LNCS 2917 






Model-Checking Based 
Data Retrieval 



An Application to Semistructured 
and Temporal Data 








Lecture Notes in Computer Science 2917 

Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 




Springer 

Berlin 
Heidelberg 
New York 
Hong Kong 
London 
Milan 
Paris 
Tokyo 




Elisa Quintarelli 



Model-Checking Based 
Data Retrieval 



An Application to Semi structured 
and Temporal Data 




Springer 




Series Editors 



Gerhard Goos, Karlsruhe University, Germany 
Juris Hartmanis, Cornell University, NY, USA 
Jan van Leeuwen, Utrecht University, The Netherlands 

Author 

Elisa Quintarelli 

Politecnico di Milano 

Dip. di Elettronica e Informazione 

Piazza Leonardo da Vinci 32, 20133 Milano, Italy 

E-mail: quintare@elet.polimi.it 



Cataloging-in-Publication Data applied for 

A catalog record for this book is available from the Library of Congress. 

Bibliographic information published by Die Deutsche Bibliothek 

Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; 

detailed bibliographic data is available in the Internet at <http://dnb.ddb.de>. 



CR Subject Classification (1998): H.2, H.3, H.4 
ISSN 0302-9743 

ISBN 3-540-20971-9 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag is a part of Springer Science+Business Media 

springeronline.com 

© Springer-Verlag Berlin Heidelberg 2004 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by Boiler Mediendesign 
Printed on acid-free paper SPIN: 10977132 06/3142 5 4 3 2 1 0 




To Alessia 




Foreword 



This thesis deals with the problems of characterizing the semantics of and 
assuring efficient execution for database query languages, where the database 
contains semistructured and time-varying information. This area of technol- 
ogy is of much interest and significance for databases and knowledge bases; 
it also presents many challenging research problems deserving an in-depth 
investigation. Thus, the topic of Elisa Quintarelli’s dissertation is well chosen 
and totally appropriate to the current research trends. 

In her thesis, Elisa addresses a number of related problems. However, her 
work and contributions concentrate on two main problems. The first is the 
definition of an effective graph-based approach to the formalization of query 
languages for semistructured and temporal information. In her approach, 
query execution is viewed as the process of matching the query graph with 
the database instance graph; therefore, query execution reduces to search- 
ing the database for subgraphs that are similar to the given query graph. 
The search for such matches can be supported through the computational 
process of bisimulation. This approach is used to define the semantics of sev- 
eral languages, including graphical languages, such as G-Log and GraphLog, 
semistructured information languages, such as Lorel, and temporal languages, 
such as TSS-QL. Both graph-based approaches and bisimulation had been 
used by previous authors for defining query languages and their semantics; 
however, this work goes well beyond previous approaches by integrating and 
refining these techniques into a flexible and powerful paradigm that Elisa 
demonstrates to be effective on a spectrum of languages and a suite of alter- 
native semantics. 

The second research challenge tackled by Elisa in her thesis is that of 
efficient implementation. This is a nontrivial problem since bisimulation can, 
in the worst case, incur an exponential time complexity. Her original solution 
to this difficult problem consists of modeling graphical queries as formulas of 
modal logic and interpreting database instance graphs as Kripke transition 
systems. In this way, the problem of solving graphical queries is reduced 
to the model-checking problem for which efficient decision algorithms exist. 
In particular, the thesis focuses on CTL formulae for which efficient model 
checkers are available; this allows Elisa to demonstrate the efficiency of her 
proposed approach with experimental results. 




VIII Foreword 



In conclusion, the thesis represents an interesting piece of research, char- 
acterized by novel contributions and an in-depth expertise on several topics. 
Indeed, the author brings together ideas and techniques from different ar- 
eas, and displays a solid expertise in computing, in general, and information 
systems, in particular. In my role as her advisor I had the privilege to see 
Elisa’s growth as a researcher; overall, I am still impressed with the depth and 
breadth of her work, the originality of her results, and her research maturity. 
Therefore, I am very happy to see that her Ph.D. thesis has been chosen to 
be published as a book in these Lecture Notes in Computer Science. 



Milan, 10/10/2003 



Letizia Tanca 




Preface 



This book contains the research covered by my Plr.D. Thesis, which was 
developed at the Dipartimento di Elettronica of the Politecnico di Milano, in 
collaboration with Prof. Agostino Dovier at the Dipartimento di Informatica 
of the Universita di Verona. 

The main topic of the book is the study of appropriate, flexible and ef- 
ficient search techniques for querying semistructured and WWW data, also 
taking into account the time dimension. 

This work was motivated by the fact that in recent years a lot of attention 
has been given by the database research community to the introduction of 
methods for representing and querying semistructured data. Roughly speak- 
ing, this term is used for data that have no absolute schema fixed in advance, 
and whose structure may be irregular or incomplete. 

A common example in which semistructured data arise is when data are 
stored in sources that do not impose a rigid structure, such as the World 
Wide Web, or when they are extracted from multiple heterogeneous sources. 
It is evident that an increasing amount of semistructured data is becoming 
available to users and, thus, there is a need for Web-enabled applications 
to access, query and process heterogeneous or semistructured information, 
flexibly dealing with variations in their structure. 

This work proposes a formalization to provide suitable graph-based se- 
mantics to languages for semistructured data, supporting both data structure 
variability and topological similarities between queries and document struc- 
tures. A suite of semantics based on the notion of bisimulation is introduced 
both at the concrete level (instances) and at the abstract level (schemata) 
for a graph-based query language, but the results are general enough to be 
adapted to other existing proposals as well. Moreover, complexity results 
on the matching techniques, which are required to find a sort of similarity 
between a graphical query and a semistructured database, have stimulated 
our interest in investigating alternative approaches to solve such graphical 
queries on databases. The main idea here is to solve the data retrieval prob- 
lem for semistructured data by using techniques and algorithms coming from 
the model-checking research field: experimental results are presented in this 
work to confirm the possibility of effectively applying the proposed method 
based on model-checking algorithms. 
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The book is structured in the following way: Chap. 1 describes the current 
scenario, the motivations and the contributions of this work. 

Chapter 2 sets the formal content, by presenting G-Log, a graph-based 
query language showing a three-level semantics based on the notion of bisim- 
ulation; the relationships between instances and schemata are investigated by 
using the theory of abstract interpretation, which provides a systematic ap- 
proach to guarantee the correctness of operating on schemata with respect to 
the corresponding concrete computations on instances. We chose G-Log be- 
cause it is a general graphical language that combines the expressive power of 
logic, the modeling power of objects, and the representation power of graphs 
for instances, schemata and logical inferences (i.e., queries and rules). The 
chapter also describes the main features of two well-known query languages, 
namely Graph-Log and UnQL, in order to show the applicability and gen- 
erality of the proposed study to other SQL-like and graphical languages for 
semistructured data. 

Chapter 3 describes the novel idea of this work. We propose an approach 
to associate a modal logic formula to a graphical query, and to interpret 
database instance graphs as Kripke Transition Systems ( KTS ). In this way, 
the problem of finding subgraphs of the database instance graph that match 
the query can be performed by using model- checking algorithms. In particu- 
lar, we identify a family of graph-based queries that represent CTL formulae. 
This is very natural and, as an immediate consequence, an effective procedure 
for efficiently querying semistructured databases can be directly implemented 
on a model-checker and the query retrieval activity can be performed in poly- 
nomial time. In Chap. 3 we focus on a graphical query language very similar 
to G-Log: we consider only the main constructs of this language that have 
a natural translation into temporal logic, but we emphasize the possibility 
of expressing universal conditions in a very compact way. In this chapter we 
also show some experimental results that confirm our proposal. 

Starting with the potential of the CTL language for time-based represen- 
tation, Chap. 4 extends standard techniques for modeling and accessing static 
information in order to model and retrieve temporal aspects of semistructured 
data, such as, for example, evolution on simple values contained in semistruc- 
tured documents. In particular we present a generic graphical data model 
suitable for representing both static and dynamic aspects of semistructured 
data, and we introduce a very simple SQL-like query language to compose 
temporal inferences. In this chapter we do not extend G-Log because it is 
more natural to express temporal conditions with SQL properties; however, 
we informally propose the possibility of encoding such properties into graph- 
ical queries by adapting the model originally introduced for G-Log in a very 
natural way. We show also the fragment of this SQL-like language that can be 
translated into the logic CTL , and thus we propose applying model-checking 
algorithms to solve temporal queries as well. In the same chapter we con- 
centrate our attention not only on the classical notion of the time concept, 
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XI 



as it is deeply known in the database field, but we further investigate the 
possibility of considering more than one dynamic aspect of semistructured 
data. In particular, we represent with the same data model valid time and 
interaction time. The first notion is used to consider the validity of facts in 
the represented reality, whereas the second one is related to user browsing 
history while navigating Web sites. 

Finally, in Chap. 5 we compare this work with others known in the liter- 
ature, while Chap. 6 summarizes the main results of the thesis and proposes 
further developments of our piece of research. 
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1. Introduction 



An increasing amount of data is becoming available electronically to humans 
and programs. Such data are managed via a large number of data models and 
access techniques, and may come from relational or object-oriented databases 
(. structured data), or consist of simple collections of text or image files (un- 
structured data). Many of them can be seen as semistructured : with the term 
semistructured data we intuitively refer to data whose schema is not fixed in 
advance, and whose structure may be irregular or incomplete. 

We can cite two most common contexts in which semistructured data 
arise: when data are stored in sources that do not strictly impose a rigid 
structure, such as the World Wide Web, and when data are combined from 
several heterogeneous, possibly structured, data sources with different struc- 
tures. 

As a consequence of the large volume and nature of data available to 
the casual users and programs, studying flexible and efficient techniques for 
automatically extracting, browsing or querying semistructured data, and in 
particular data from Web pages, remains a complex and important task in 
the database community. 

In this book we base our work on the use of graph-based representations 
for semistructured data and queries, and address two specific problems in this 
data context: the first concerns the study and definition of a suite of semantics 
for the graph based query language, that allows query composers to relax or 
restrict topological requirements at their choice. The second main problem 
addressed is the usage of model-checking based techniques to efficiently solve 
graphical queries on static and dynamic semistructured documents. 



1.1 Motivations 

With this work we would like to bring some innovative contributions to 
semistructured and temporal data management. In particular, we concen- 
trate on a general graph-based data model which offers strong expressive 
and modeling power, and on a user friendly graphical language. We deeply 
study a method to add flexibility to the data retrieval activity when static 
and temporal queries are naturally expressed by labeled graphs. Moreover, 
we consider also complexity results on the required procedures and therefore 

E. Quintarelli: Model-Checking Based Data Retrieval, LNCS 2917, pp. 1-8, 2004. 
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focus on a technique based on model-checking algorithms to efficiently solve 
a subset of the queries. 

In the recent years a number of research projects have addressed the 
problem of accessing in a uniform way semistructured information often rep- 
resented by using data models based on directed labeled graphs. Among these, 
we can cite LOREL [66], UnQL [12], WG-Log [32], WebSQL [67], WebOQL [7] 
and, StruQL [50]. 

Intuitively, the term semistructured data refers to data with (some of) 
the following features: 

— the schema is not given in advance and may be implicit in the data; 

— the schema is relatively large w.r.t. the size of the data and may change 
frequently; 

— the schema is descriptive rather than prescriptive, i.e. it describes the cur- 
rent state of the data, but violations of the schema are still tolerated; 

— the data are not strongly typed, i.e. for different objects, the values of the 
same attribute may be of different types. 

Effectiveness and efficiency are mandatory requirements when accessing 
this kind of information; therefore, also in this case appropriate search tech- 
niques are more than necessary. Pure keyword-based search techniques have 
often proved to be ineffective, since in that setting only the document lexicon 
is taken into account, while the intrinsic semantics conveyed by the document 
structure is lost; in practice, often this leads to the retrieval of too many doc- 
uments, since also the ones that do not share the required structure are often 
included into the result. It is natural to study techniques that go beyond the 
basic information retrieval paradigm supported by most search engines, and 
consider also the (partial) structure of documents. 

The claim of this work is that, in order to take full advantage of the 
document structure in the semistructured setting, its hierarchical (or, more 
generally, topological) organization should be somehow exploited, by intro- 
ducing some notion of query like the one used in database systems, being still 
aware of the fact that the document structure is far from being as strict as in 
the usual database context. In particular, we focus on flexible and efficient 
methods for searching static and dynamic information on the Web. 

Indeed, efficient searching techniques can be naturally extended in order 
to verify not only static properties but also changes in semistructured data. 
In the past years temporal database management was a productive field of 
research, that is now covered by textbooks (see for example [48, 3, 99]) and 
applied in various contexts (see for example the RapidBase system [97]). Sev- 
eral data models, which usually extended the relational one [29, 74, 86], were 
proposed in order to consider the time-varying nature of data; however, in 
the semistructured database context there are few works on this research 
direction, although the problem of representing changes appears more stim- 
ulating in this setting for the irregularity, incompleteness and lack of schema 
that often characterize semistructured data. Moreover, we think that it is 
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worthwhile to investigate the dynamic aspects of this kind of data that arise 
from different notions of the time concept. In particular: 

— Valid Time, the valid time (VT) of a fact is the time when the fact is 
true in the modeled reality [57]. 

— Transaction Time, the transaction time (TT) of a fact is the time when 
the fact is current in the database and may be retrieved [57]. 

As a matter of fact, if we restrict our study on semistructured data that 
arise in the World Wide Web context, we can easily observe that the concept 
of time has a more general meaning w.r.t. the classical notion used in temporal 
databases, and does not include only the natural evolution of data through 
time. 

In this work we also investigate another time dimension, strictly related to 
web data: Interaction Time is valid time relative to user browsing. When 
a user browses through a document (for example a hypermedia representa- 
tion) they choose a particular path in the graph representing semistructured 
information and in this way they define their visit order between objects. 
By appropriately collecting and querying such information, we can create a 
view depending on each user’s choices, that can be useful to personalize data 
presentation. 

We think that temporal aspects of semistructured data are related to 
the main issues that the research community studying techniques to adapt 
applications to users is now considering. Indeed, in order to customize the 
presentation of information it is therefore very important to capture the his- 
tory of interactions between users and the applications they browse. 

These considerations have motivated our work that studies a novel and 
efficient method based on model-checking techniques for automatically ex- 
tracting semistructured and time-based data, in particular w.r.t. Valid Time 
and Interaction Time. 



1.2 Overview of the Book 

Semistructured data are often represented by labeled graphs, a data model 
which can be made less rigid than the tabular one: note that the web itself 
may be seen as a large, graplr-like database. In addition, a lot of recent 
proposals have concentrated on issues concerning techniques for introducing 
flexible and expressive query languages. 

This book contains several contributions toward this end. In Chapter 2 we 
formalize a method to provide suitable graph-based semantics to languages 
for semistructured data, supporting both data structure variability and dif- 
ferent topological similarities between queries and document structures. 

We start by observing that the data retrieval activity requires the devel- 
opment of graph algorithms, since queries (graphical or not) are expected to 
extract information stored in labeled graphs. 
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In order to do that, it is required to perform a kind of matching of the 
query graph with the database instance graph. More in detail, we need to 
find subgraphs of the instance of the database that match (e.g. they are 
isomorphic or somehow similar to) the query graph. 

Our approach to meet the effectiveness and efficiency requirements in 
querying semistructured data is to make the required topological similarity 
flexible, in order to support different similarity levels. Therefore, one aim 
of the chapter is to illustrate effective techniques that allow the query com- 
poser to relax or restrict topological requirements at their choice. One main 
contribution is the design of a suite of semantics for the graphical language 
G-Log [81], based on the notion of bisimulation [70] (see also [12, 37, 64]), 
given both at the instance and at the schema level. In particular, we discuss 
in full detail the benefits of tuning the semantics by enforcing or relaxing 
requirements of the bisimulation relation. 

We choose G-Log because it is a general graphical language which com- 
bines the expressive power of logic, the modeling power of objects, and the 
representation power of graphs. It was initially conceived as a deductive lan- 
guage for complex objects with identity , since its data model captures most of 
the modeling capabilities of the object-oriented query languages (i.e. struc- 
tural and semantic complexity) but lacks the data abstraction features, also 
typical of the object-oriented languages and databases, that are more related 
to the dynamic aspects of the system. Moreover, in G-Log the representation 
of logical inference is also graphical. 

Our work started as a part of the WG-Log project [32], which addressed 
the problem of Web information querying by enriching G-Log [83, 81] with 
constructs typical of hypermedia, like entry points, indexes, navigational 
edges, etc. A subsequent project [18, 31, 42], still in the area of semistructured 
information querying, addressed the problem of querying XML-specified in- 
formation, and still investigated the possibilities of flexible graph-based query 
formulation. To this aim, in [78] the XML-GL language is translated into G- 
Log, in order to take advantage of the parametric semantics defined here. 
The results presented here for G-Log can thus be easily extended to WG-Log 
and XML-GL as well. 

As a typical situation with semistructured information is that the data- 
bases lack an a-priori schema, we used the theory of abstract interpretation 
to derive a common schema for a given (set of) document bases, and keep 
trace of changes made to the underlying instances. Thus, as schemata can 
evolve gracefully with the evolution of their instances in the extended setting 
of WG-Log and XML-GL, and more in general in the graph-based languages 
similar to G-Log, this will allow us to trace the evolution of documents and 
Web pages by keeping trace of the history of their DTD’s or schemata. This 
possibility to easily keep trace of schemata evolution is useful to apply queries 
for finding out information about the structure of Web sites. 
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The suite of semantics defined for G-Log, all based on the bisimulation 
concept, allows us to support different similarity levels, but in general is not 
applicable to real-life size problems, because even if the problem of establish- 
ing whether two graphs are bisimilar or not is polynomial in time [60, 79], 
the task of finding subgraphs isomorphic or bisimilar to a given one is NP- 
complete [44]. 

At this point, our main problem consists in finding algorithms to efficiently 
implement the data retrieval activity for semistructured data. In order to 
overcome this limiting but crucial issue, we observe that graphical queries 
can be easily translated into logic formulae. In fact it is possible to find 
many attempts to give a graph-based representation of first-order logic or to 
translate graphical queries into logical formulae (see for example [81, 33, 10, 
61]). In general, techniques for translating graphs in formulae have also been 
exploited in non-database related literature [14]. 

The novel idea of this work is to associate a modal logic formula T' to a 
graphical query, and to interpret database instance graphs as Kripke Transi- 
tion Systems (KTS). In this way we reduce the problem of solving a graphical 
query w.r.t. a semistructured database to an instance of the model- checking 
problem. 

Model-checking [27, 73, 28] has emerged in the last decade as a promising 
and powerful approach to the automatic verification of systems. Intuitively, 
a model-checker is a procedure that decides whether a given structure M is 
a model of a logical formula p. More in detail, M is an (abstract) model 
of a system, typically a finite automata-like structure, and p is a modal or 
temporal logic formula describing a property of interest. 

In Chapter 3, we use a modal logic with the same syntax as the temporal 
logic CTL. In this way, finding subgraphs of the database instance graph that 
match the query can be performed by finding nodes of the KTS derived from 
the database instance graph that satisfy the formula . This is an instance 
of the model- checking problem, and it is well-known that if the formula T 
belongs to the class CTL of formulae, then the problem is decidable and 
algorithms running in linear time on both the sizes of the KTS and the 
formula can be employed [27] . 

In this work we identify a family of graph-based queries that represent 
CTL formulae and discuss the expressive limits of this propositional temporal 
logic used as a query language. Observe that in this first setting, the notion 
of different instants of time represents the number of links the user needs to 
follow to reach the information of interest. 

Our translation from graphical queries to modal formulae is very natural 
and, as immediate consequence, an effective procedure for efficiently querying 
semistructured databases can be directly implemented on a model checker. 
We use a “toy” query language called W mainly based on the G-Log language: 
we do not consider some details of G-Log (e.g. the possibility to generate new 
relations in a given database by means of rules) but add the possibility to 
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express universal conditions in a graph-based setting. We can say that W 
is a representative of several approaches in which queries are graphical or 
can be easily seen as graphical (cf. e.g. Lorel [4], G-Log [81], GraplrLog [33], 
and UnQL [50]). We will relate W to UnQL, GraplrLog and G-Log and show 
the applicability of the method for implementing (parts of) these languages. 
The model-checking technique for data retrieval could also be applied to nou- 
graplrical query languages: formula extraction from SQL-like queries can be 
done by using similar ideas. 

We have effectively tested the approach by using the model-checker 
NuSMV [26] to confirm that our technique can be really applied. 

It is known that model-checking has been used in the past years as a 
powerful approach to the automatic verification of systems, and is especially 
applied to verify temporal properties on reactive systems [49] . 

Leaving from the potential of the CTL language for time-based represen- 
tation, in Chapter 4 we adapt the technique introduced for accessing static 
information stored in graphical databases to retrieve temporal aspects of 
semistructured data as well. 

Most applications of database technology are temporal in nature. In fact, 
in many applications, information about the history of data and their dy- 
namic aspects are just as important as static information. Consider, for ex- 
ample, records of various kinds: financial, personnel, medical and scheduling 
applications such as airline, train and hotel reservations. 

These kinds of applications rely on temporal databases, which record time- 
referenced data. The main concepts of classical temporal databases can there- 
fore be applied and adapted to semistructured databases, which are usually 
represented by means of labeled graphs instead of flat tables. Thus, our first 
step for studying temporal aspects of semistructured data is to extend graph- 
based data models to keep trace of historical information. In order to repre- 
sent dynamic aspects of this kind of data and to allow queries about their 
evolution through time, we chose to add to the graph the information about 
the time interval of objects and relations. This kind of information repre- 
sents the valid time of objects and relations, i.e. the time when the objects 
or relations are true in the modeled reality [57]. 

In particular, in this work we analyze the possible contexts where semi- 
structured applications arise, and consider several dynamic aspects of semi- 
structured data representation, based on different uses of the concept of “time 
interval” . 

We first define a graphical model allowing to represent and retrieve tem- 
poral information about semistructured data according to the concepts of 
Valid Time and Interaction Time. 

The relevance of Valid Time is widely accepted, since, as in the classical 
database field ([90], [25], [24]), also in the context of semistructured data it 
is interesting to take into account the dynamic aspects of data. 
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Interaction Time is a somehow novel field of discussion, as nowadays the 
explosive growth of information sources available on the World Wide Web, 
has made necessary for users to utilize automated tools to find quickly the 
desired information resources, and for site developers to track and analyze 
their usage patterns. These factors are deeply studied in a research area called 
Web Mining [34] that can be broadly defined as the automatic discovery and 
analysis of useful information from the World Wide Web. 

In this work we show that a general temporal model can be adapted to 
keep trace of usage patterns while navigating web sites. The analysis of how 
users are accessing a site is critical for optimizing the structure of the Web 
site and can be placed among the pre-processing activities required for mining 
before the mining algorithms can be run [34] . Our proposals is thus to use a 
database approach for storing sequences of page references before any mining 
is done. 

Results on such historical analysis can be successively used for improving 
and customizing site content presentation: in fact the graph-based temporal 
representation is used not only to keep track of user’s preferences, but also 
to pose all sorts of different queries by means of a SQL-like query language 
based on path expressions, called TSS-QL [77]. 

This language is not graphical because in our opinion the addition of 
the temporal dimension makes it difficult to directly express generic tempo- 
ral properties in a graph-based form (e.g. with G-Log graphs); however, a 
graphical version has also been designed, which is an extension of the G-Log 
language. This encoding allows to reduce the problem of solving a query on 
a graph-based model to the problem of performing a matching of the “query 
graph” with the “site graph”. Again, in order to overcome complexity limits 
of algorithms for solving the matching problem between subgraphs, we find 
the fragment of our language that can be correctly translated into the logic 
CTL for applying model-checking techniques also to solve temporal TSS-QL 
queries. 



1.3 Contributions 

This book investigates static and dynamic properties of semistructured data 
by concentrating on a new perspective to efficiently solve queries on such 
kind of data. In particular, this work is a proposal to deal with the necessity 
to add flexibility and efficiency to methods for solving graphical queries by 
considering approaches in quite different research areas, going from standard 
database techniques to temporal databases, Web Mining [9, 23, 34, 100] and 
Model-Checking [49]. 

We differ from other works in the literature in several respects: 

1. We study a search technique that overcomes the basic information re- 
trieval paradigm supported by most search engines and consider also the 
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structure of documents, to use for adding flexibility to query methods. 
In particular, we show that by considering different notions of similar- 
ity between query graphs and instance graphs it is possible to relax or 
restrict the set of retrieved information. We also formalize the relation- 
ship between semistructured instances and their schemata by the appli- 
cation of the theory of Abstract Interpretation; this approach allows us 
to emphasize the ability though schemata whenever it is useful to infer 
information about the structure in order to query databases modeled by 
labeled graphs. 

2. In the semistructured database context, we extend a general model based 
on labeled graphs in order to store changes to data. The model is general 
enough to be applied in different contexts; in fact we consider more than 
one dynamic aspect of semistructured data, and by the same model we 
can store information either on the natural evolution of data through 
time or on user activities while navigating Web sites. 

3. This work is also an attempt to formalize the relationships between the 
data retrieval activity in the semistructured context and the model- 
checking problem. This well-known technique for verifying temporal 
properties on Kripke Transition Systems is an efficient procedure to solve 
(graphical) queries on semistructured databases. In particular, we show 
how to translate queries into CTL formulae and graphical instances into 
Kripke Transition Systems; some experimental results are reported to 
confirm the feasibility of this technique. We explain the main restrictions 
of the proposed approach and the expressive power limits of propositional 
temporal logics used as query languages. 

To sum up, we may say that ours is a very comprehensive approach to 
problems related to modeling and querying semistructured data, more global 
than most of the ones present in the literature, and benefit from the advan- 
tages of techniques originally designed in different research areas (Databases, 
Logics, Web Mining and Model-Checking) to deal with the need of flexible 
and efficient search methods for retrieving static as well as dynamic informa- 
tion. 
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Computational models based on graph transformation are used as semantic 
domains for various kinds of formalisms and languages like, for example, 
actor systems, the 7r-calculus, functional programming, database query and 
manipulation languages, and neural networks. 

Also semistructured data are often represented by using data models 
based on directed labeled graphs: among these we can cite, UnQL [12], 
LOREL [66], WebSQL [67], WebOQL [7], StruQL [50], GraphLog [33] and 
G-Log [81]. Effectiveness and efficiency are mandatory requirements when 
accessing this kind of information; therefore, appropriate search techniques 
are more than necessary and a lot of work has been done to face the problem 
of accessing in a uniform way this kind of data with graph-based queries. In 
some approaches queries are really graphs [50, 33, 81] while, in others, queries 
can be written in extended SQL languages [12, 4, 16]. 

In this work, we mainly refer to the graphical representation and query 
style of G-Log, a well-known database language for complex objects [83, 81]. 
The reason of the choice of G-Log stands on observing that most of the 
models and languages for representing and querying semistructured informa- 
tion share an analytical approach to data representation, lacking a synthetic 
notion of schema. Conversely, G-Log models semistructured information by 
using a concept very close to that of database schema, that in this context 
enables the user to formulate a query in an easier way. Nevertheless, the use 
of a schema-like facility, however desirable, should not be mandatory, since 
we may well imagine a situation where the user is not fully aware of the 
document’s exact organization. In this case, assuming a strict matching be- 
tween the document and the required topological structure may lead to miss 
some still interesting documents that do not adhere precisely to the query 
structure. 

Our approach to attack these problems is to make the required topological 
similarity flexible, in order to support different similarity levels. Therefore, 
the aim of this chapter is to illustrate effective techniques that allow the 
query formulator to relax or restrict topological requirements at their choice. 
Its main contribution is the design of a suite of semantics for G-Log, based 
on the notion of bisimulation [70] (see also [64, 12]), given both at the in- 
stance and at the schema level. In particular, we discuss in full detail the 
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benefits of tuning the semantics by enforcing or relaxing requirements on the 
bisimulation relation. 

The relationship between instances and schemata is investigated using the 
theory of Abstract Interpretation [38] , which provides a systematic approach 
to guarantee the correctness of operating on schemata with respect to the 
corresponding concrete computations on instances. 

The revisitation of the semantics of G-Log also clarifies some subtle ambi- 
guities in the initial semantics of G-Log queries. Since the semantics is based 
on the notion of bisimulation, the implementation of the language will in- 
herit all the algorithmic properties studied in the literature 1 . In particular, 
Kanellakis and Smolka in [60] relate the bisimulation problem with the gen- 
eral (relational) coarsest partition problem, and they propose an algorithmic 
solution and pointed out that the partition refinement algorithms in [79, 45] 
can serve, and more efficiently, to the same task. Applicability of our ap- 
proach is strongly based on this efficient implementation of the bisimulation 
tests. 

Alternative approaches to the semantics of graphical languages have been 
introduced in the literature. For instance, the semantics of the language 
GraphLog is given via rewriting into DATALOG. Our choice of giving di- 
rectly a graph-based semantics is not only justified by the fact that this is a 
typical approach for visual languages, but also, and more significantly, by the 
fact that the expressive power of G-Log is higher than that of DATALOG. 

Our approach may remind previous works on the two languages UnQL [12] 
and GraphLog [33] ; differences between G-Log and them are mainly related to 
the expressive power: for instance, G-Log allows to express cyclic information 
and queries, and achieves its high expressive power by allowing a fully user- 
controlled non-determinism. Addition of entities, other than the addition of 
relations allowed in GraphLog, is admitted in G-Log which, in its full form, 
is Turing complete [81]. 

This issue is further dealt with in Chapter 5, where it becomes clear that 
our results can be extended to these languages as well, and in general to 
any graphical language whose main aim is to query and transform a graplr- 
orientecl data model by using graph-based rules. 

In this chapter, Section 2.1 introduces the language G-Log and Section 2.2 
explains the (concrete) semantics of the language, showing the three-level 
semantics which introduces flexibility. In Section 2.3 some results for the 



1 Bisimulation is usually attributed to Park, who introduced the term in [82], 
extending a previous notion of automata simulation by Milner ([68]). Milner em- 
ploys bisimulation as the core for establishing observational equivalence of the 
Calculus of Communicating Systems (CCS) ([69, 70]). In the Modal Logic/Model 
Checking areas this notion was introduced by van Benthem (cf. [94]) as an equiv- 
alence principle between Kripke Structures. In Set Theory, it was introduced as 
a natural principle replacing extensionality in the context of non well-founded 
sets [5]. 
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semantics proposed are given in detail; here different types of rules are an- 
alyzed and the differences between the three semantics are highlighted. In 
Section 2.4 the notion of abstract graphs (corresponding to schemata) is in- 
troduced, and the concepts of abstract interpretation are applied. In some 
cases query applicability can be tested directly on schemata; this means that 
they represent instances correctly. Moreover, operating on schemata is useful 
whenever one wants to investigate on general properties on the structure of a 
given (semistructured) database. This section also shows how schemata can 
be derived by abstraction (in n log n time) from instance sets, thus allowing to 
deduce a common scheme or DTD from a set of documents. As schemata can 
evolve gracefully with the evolution of their instances (applying the theory of 
abstract interpretation) , in the extended setting of WG-Log [32] and XML- 
GL [18], and more in general in the graph-based languages similar to G-Log, 
this will allow to trace the evolution of documents and Web pages by keeping 
trace of the history of their DTD’s or schemata. Section 2.5 introduces a log- 
ical and model theoretic view of G-Log graphs and the relationships with the 
concrete semantics are analyzed. In Section 2.6 we point out the relationships 
between our three-levels semantics and the G-Log original semantics defined 
in [81]. To conclude, Section 2.9 presents UnQL and GraphLog, two query 
languages for semistructured databases that we will consider in this work to 
show the applicability of the model-checking based data retrieval approach 
to both SQL-like and graphical query languages. 



2.1 G-Log: a Language for Semistructured Data 

2.1.1 An Informal Presentation 

In this section we introduce some intuitive examples of queries in the language 
G-Log, in order to appreciate its expressive power and to emphasize some 
ambiguities we are going to tackle later on. 

Consider the graph depicted in Fig. 2.1 (a). It represents the query ‘collect 
all the people that are fathers of someone’. Intuitively, the boldface part of the 
graph (also called the ‘green part’) is what you try to get from the database, 
while you match the rest of the graph (also called the ‘red part’) with a graph 
representing the database instance. 

The query ( b ) of Fig. 2.1 can be read as ‘collect all the workers having 
(at least) one son that works in some town’. 

Also negative requirements can be introduced in a query by means of 
dashed edges and nodes. This is depicted by query (c) of Fig. 2.1 whose 
meaning is ‘collect all the workers having (at least) one son that works in a 
town different from that where his father works’. Moreover, the meaning of 
query (d) is ‘collect all the people who are not father of someone’. 

The translation of queries (a), ( b ), (c), ( d ) into logical formulas is also il- 
lustrated in Fig. 2.1 (with abbreviations for predicate symbols). As observed 
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Figure 2.1. Sample queries 



in [81], G-Log offers the expressive power of logic, the modeling power of 
object-oriented DBs, and the representation power of graphs. 

However, the modeling power of G-Log is heavily constrained by some 
arguable choices in its semantics [81]. Consider, for instance, query (e) of 
Fig. 2.2: it can be intuitively interpreted in three different ways: 

— collect the people having two children, not necessarily distinct; 

— collect the people having exactly two (distinct) children; 

— collect the people having at least two (distinct) children. 

The semantics of G-Log as given in [81] uniquely selects the first option. As 
a consequence, queries (a) and (e) become equivalent, so there is no way to 
express ‘collect the people that have more than one child’ without making 
use of negative information (negated equality edges in G-Log [81]). 

An even deeper problem arises when considering query (/): in G-Log it 
has exactly the same meaning as query (6). In other words, it is not possible 
to express a query like ‘collect the people that work in the same town as (at 
least) one of their children’ in a natural fashion. Actually, such a query can 
be expressed in G-Log, but not in a straightforward way. Of course, these 
problems are further emphasized when combined with negation. 

In order to address this kind of ambiguities, in the Section 2.2 we revisit 
the semantics of G-Log taking advantage of the use of the well-known concept 
of bisimulation. Furthermore, we apply the abstract interpretation approach 
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to the semantics defined in this way, in order to clarify the relationship be- 
tween concrete (instances) and abstract (schemata) data representations. 

2.1.2 Syntax of G-Log 

In this section we introduce the basic aspects of the syntax of the G-Log lan- 
guage. Definitions are based on the concept of directed labeled graph, and, 
differently from [81, 83], rules, programs, and queries are defined indepen- 
dently of the context in which they are used. This simplifies the notation and 
allows the study of algebraic properties of programs. However, the semantics 
(cf. Section 2.2) will be given in such a way that the practical use is coherent 
with that of [81, 83]. 

Definition 2.1.1. A G-Log graph is a directed labeled graph (N, E,£), where 
N is a (finite) set of nodes, E is a set of labeled edges of the form (m, label, n), 
where m,n € N and label is a pair of C x (£ U {_L}), while £ : N — > 
(T U {_L}) xCx (£U{_L}) x (<SU{_L}). _L means ‘undefined’, and: 

— T = { entity, slot } is a set of types for nodes; 

— C = { red, green, black } is a set of colors; 

— £ is a set of labels to be used as entity, slot, and relation names; 

— S is a set of strings to be used as concrete values. 

£ is the composition of four single-valued functions £t,£c,£c,£s- When the 
context is clear, if e = ( m,(c,k),n ), with abuse of notation we say that 
fc(e) = c and £p(e) — k. Moreover, we require that: 

— (Vx € N)(£t(x) slot — » £s(x) = _L) (i.e., values are associated to slot 
nodes only), 

— (V(m, label, n) £ E)(£r{jn) ^ slot ) (i.e., slot nodes are leaves). 

Observe that two nodes may be connected by more than one edge, pro- 
vided that edge labels be different. 





XX {x ■ 3y 1 y 2 p( £c ) A f( x , yi) A p(y\)A ( {x : 3y 1 y 2 p(x) A f (x , A p(yi) A 

f(x,y 2 )Ap(y 2 )} V ' w(x, y 2 ) A w(y 1 , y 2 ) A t(y 2 )} 



Figure 2.2. Problematical queries 
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Red (RS) and black edges and nodes are graphically represented by thin 
lines (this does not originate confusion, since there can not be red and black 
nodes and edges in the same graph), while green (GS) by thick lines. Red 
and green nodes are used together in queries. 

Colors red and green are chosen to remind traffic lights. Red edges and 
nodes add constraints to a query (= stop!), while green nodes and edges 
correspond to the data we wish to derive (= walk!). 

Result nodes play a particular role in queries: they have the intuitive 
meaning of requiring the collection of all objects fulfilling a particular prop- 
erty. Moreover, result nodes can occur in (instances of) web-like databases to 
simulate web pages connecting links. In this paper, if an entity node is labeled 
by result 2 it will be simply represented by small squares, and its outcoming 
edges implicitly labeled by ‘connects’. In general, an entity (slot) node n will 
be represented by a rectangle (oval) containing the label £c(n). 

As an instance, consider the graph (e) of Fig. 2.2. Let 1 be the topmost 
node, 2 the center node, 3 the leftmost, and 4 the rightmost node. Then 

G = { N = { 1,2, 3, 4}, 

E = { (1, (GS, connects), 2) , 

(2, (RS, father), 3), 

(2, (GS, father), A)}, 

£ = { 1 i— > (entity, GS, result, _L), 

2 i— > (entity, RS, Person, _L), 

3 i— > (entity, RS, Person, _L), 

4 i— > (entity, RS, Person, _L)} ) 

Definition 2.1.2. Let G = ( N,E,l ) and G' = (N',E',£') be G-Log graphs. 
We say that G is a labeled subgraph of G' , denoted G G G' , if N C N' , 
E C E' , and l = £'\n (i.e., for all x € N it holds that £(x) = £'(x)). 

With s we denote the (empty) G-Log graph (0,0,0). It is immediate to 
see that given a G-Log graph G, then 

({G' is a G-Log graph : G 1 G G},G) 

is a complete lattice, where T = G, _L = e. Moreover, given two G-Log graphs 
G\ = (N\,Ei,ti) G G and G 2 = (N 2 ,E 2 ,£ 2 ) G G, where £\\n 2 = their 
l.u.b. and g.l.b. can be computed as 3 

G\ U G 2 = (N\ U N 2 , Ei U E 2 , t\ U if) 

G\ n g 2 = (Ni n n 2 ,E\ n e 2 ,£i r\i 2 ) 

2 [41] uses entry point nodes for this purpose. 

3 As a side remark, notice that, if G is the (complete) graph (N, N x {±} x N, l) and 
n = \N\, then the lattice contains: Y(n=o (7)2* = 0(n2 n ) subgraphs. If G is not 
of this form, it is difficult to find the exact number; however, if \E\ = 0(|A| 2 ), 
then the upper bound remains the same as the complete case. 
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We recall that a complete lattice is a non-empty ordered set P such that 
the the least upper bound and the greatest lower bound exist for each subset 
S of P. 

Definition 2.1.3. Given a G-Log graph G = ( N , E, £), and a set C of colors, 
C C C, consider the sets 

N' = N (~l £^ 1 (C) and E' = {(m, (c, k),n) £ E : c £ Cj. 

The restriction of G to the colors in C, denoted G\c is defined as G\c = 
(N',E',e\ N ,). 

Observe that G\c is not necessarily a graph, since, for instance, it may 
contain only edges and no nodes. 

Now we introduce the notions of concrete graph, rule, and program. 
Through them, we aim at characterizing the instances of a semi-structured 
database like a WWW site. 

Definition 2.1.4. A G-Log concrete graph is a G-Log graph such that: 

1. ( \/x £ N U E) (£c (x) = black), and 

2. (Vn £ N)(£q-(n) = slot^> £s{n) ^ X). 

With Q c we denote the set of G-Log concrete graphs. 

Definition 2.1.5. A G-Log rule R = ( N,E ,£ ) is a G-Log graph such that: 

1. (Vx £ N U E) {tc ( x ) 7^ black), 

2 - -R|{GS} 0. 

3. (Ve = (m, label, n) £ E)(£c(e) = RS — > fc(n) = RS), and 

4 ■ (Vn, n! £ N)(fi c (n) = GS A £ c (n') = RS) (£ c (n) # £ c {n') V £ t (ti) # 
£r{n'))). 

The third condition is necessary to avoid the possibility of dangling red 
solid edges, while the fourth condition is introduced to avoid the possibility 
of infinite generation of nodes (cf. Remark 2. 3.1). 4 Notice that it can be the 
case that = 0. This corresponds to an unconditional query or update. 

In general, we can split the notion of rule in two concepts: query and update. 
Basically, queries are expected to extract information from a graph (i.e., no 
existing class of objects is modified), whereas updates are expected to build 
up new instances of the graph (i.e., classes and relationships can be modified). 

Remark 2.1.1. Observe that no restriction is imposed on the structure of the 
graphs. Graphs can contain cycles of any kind. 



4 This corresponds to the well-known problem of OID generation in logical object 
oriented DB’s [1]. 
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Definition 2.1.6. Given a G-Log rule R = ( N,E ,£ ) and a graph G = 
(N' , E' , £’), The rule R is a query with respect to G if the following con- 
ditions hold: 

1. (Vn G N){tc{n) = GS -► (Vn' G JV')(4c(n) 7^ 4c(«') V 4r(n) ^ 4r(«')))> 

2. (Ve G E)(£c(e) = GS — > (Ve' G B / )(^(e) ^ ^(e 7 ))). 

TTie rwfe 1? is an update with respect to G if it is not a query. 

As a matter of fact, this formal notion does not correspond exactly to 
the common usage of the word “query”: we further distinguish two kinds 
of queries: generative queries retrieve some objects and relationships and, 
based on them, construct new concepts. For instance, from the notion of 
parent, a generative query (the transitive closure) can construct the notion 
of ancestor. Pure retrieval queries associate links to a number of objects 
enjoining a common property. The last notion captures the common intuition 
of query. The previous one is more related to the usual notion of view. 

A computation can be seen as a sequence of applications of a sequence of 
rules to a graph. This leads to the following definition of program. 

Definition 2.1.7. A G-Log program is a finite list of sets of G-Log rules. 

G-Log allows the expression of complex queries by means of programs. 
Sometimes it is worthwhile to have the possibility of expressing transitive 
properties: in G-Log a set of two rules is enough. 





(Hi) 

Figure 2.3. A G-Log program 



(h 2 ) 



For instance, the program of Fig. 2.3 expresses the following transitive 
property: ‘if two students attend the same course, they are schoolfellows. 
And, if a student a: is a schoolfellow of a student y and y is a schoolfellow of 
a student 2 then x is z’s schoolfellow’. 

Now we revisit the semantics of G-Log by using the bisimulation relation 
to define the concept of similarity between the query graph and the database 
instance graph. 
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2.2 Bisimulation Semantics of G-Log 

We present a three-level semantics, based on the concept of bisimulation. 
In Section 2.6 we compare this approach with that based on the concept of 
embedding of a graph used in [81]. 

First, let us remind the well-known concept of bisimulation [82, 70] 
adapted to our setting: 

Definition 2.2.1. Given two G-Log graphs Go = (N 0 ,E 0 ,(°) and G\ = 
(Ni, Ei, f 1 ), a relation b C No x N\ is said to be a bisimulation between Go 
and G\ if and only if: 

1. for i = 0,1, (Vrii G Ni)(3ni-i G fVi_i)n 0 b n\, 

2. (Vn 0 G A r 0 )(Vni G Ni)(n 0 b n\ — > t^-(n 0 ) = A ^(n 0 ) = A 

t°s( n o) = i)) (where = means that if both labels are defined — i.e., 

different from _L -they must be equal), and 

3. for i = 0,1, (Vn G Nf), let 

Mi(n ) —def {(to, label) : (n, (color, label), m) G E{\. 

Then, (Vno G fVo)(Vni G N\) such that no b n\, for i = 0, 1 it holds that 

( y(mi,li ) G M i (n i ))(3(m 1 -i,i 1 -i) G 
(to 0 b TOi A = ti_j) . 

We write Go ~ Gi (Go ^ G\) if b is (not) a bisimulation between Go and 
G\. We write Go ~ Gi (Go / Gi) i/ t/iere is (hot,) a bisimulation between 
Go and G\: in this case we also say that Go is bisimilar to G\. 

A brief explanation of the conditions above may be useful. Condition 1 is 
obvious: no node in the two graphs can be left out of the relation b. Condition 
2 states that two nodes belonging to relation b have same type and same 
label, exactly. Moreover, if they are just slots, then either one of their values 
is undefined, or they have also the same value. Finally, condition 3 deals with 
edge correspondences. If two nodes no,n\ are in relation b, then every edge 
having no as endpoint should find as a counterpart a corresponding edge with 
n± as endpoint. 

As an example, consider the graphs in Fig. 2.4: 

— Go ~ Gi ~ Gi, as well as the reflexive, symmetric and transitive closure 
of this fact, since ~ is an equivalence relation (cf. Lemma 2.2.1); 

— Go / Go since it is impossible to ‘bind’ a node labeled by P with one 
labeled by Q. Thus, condition (2) cannot be verified; 

— Go / G4 since it is impossible to verify condition (3). 

5 In this chapter we do not face the problem of negation (dashed nodes and edges). 
A line for future work is drawn in Section 2.7. 
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Figure 2.4. Bisimilar and not bisimilar graphs 




(G 3 ) (G 4 ) 



Notice that colors (represented by the lc function) are not taken into 
account in the bisimulation definition, while the value fields of the label are 
considered only when they are defined. The last feature will allow to apply 
bisimulations between schemata and instances (see Section 2.4). 

The bisimulation relation can be further refined by introducing additional 
conditions: 

Definition 2.2.2. Given two G-Log graphs Go = (N 0 ,E 0 ,( 0 ) and G \ = 
(N\,Ei,£i), and b C Nq x Ni we say that 

— b is a directional bisimulation, denoted by Go ~ Gi, if Go ~ Gi and b is 
a function from No to Ni. We say that Go~Gi if there is a b such that 

G b f , 

0 ~ Gi . 

— b is a bidirectional bisimulation, denoted by Go = G\, if Go ~ Gi and b 
is a bijective function from No to N\ . We say that Go = G\ if there is a b 

such that Go = Gi . 

Again in Fig. 2.4, we have that: 

— Gi~Go, G 2 ~G 0 , while the converse is not true. 

— Gi = Gi for i = 0 , 1 , 2, 3 , 4, while Gi ^ Gj for i ^ j. 

The three relations defined above will be used to define the semantics that 
we are going to study in the rest of the chapter: 

~ is used to build a semantics based on bisimulation; 

~ is used to build a semantics based on the concept of bisimulation that is 
also a function; 

= is used to build a semantics based on graph isomorphism ( injective em- 
beddings [81] or, equivalently, bisimulations that are bijections). 

Some basic properties of these relations are emphasized in the following 
lemma. 

Lemma 2.2.1. 1. and =, are reflexive and transitive relations; 

2. Both ~ and = are symmetric relations and thus, equivalence relations; 

3. ~ is a preorder (and not an ordering). 

I G = G' if and only if G~G' and G'~G. 
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Proof. Follows immediately from the definition for 1 and 2. For 3, consider 
the graphs: 




G i G 2 

It holds that Gi~G 2 and vice versa. However, the two graphs are not the 
same graph. For proving statement (4), it is sufficient to observe that, by 
the requirement (1) of the definition of bisimulation (Definition 2.2.1), the 
two functions ensuring that G'~G' and G'~G are onto. This means that 
| AT | = \N'\ and, thus, both functions are bijections. □ 

Remark 2.2.1. The semantics of the original definition of the language G- 
Log [81] is based on a different notion of matching of graphs: the so-called 
embedding. Relationships between our proposals and the embedding are ex- 
plained in Section 2.6. 



2.2.1 Semantics of Rules 

The first two notions that we define are the applicability of a rule and the 
satisfiability of a graph, given a rule. Note that in this chapter we deal with 
the fragment of G-Log without negation. 

Definition 2.2.3. Let G be a concrete graph and R a rule. For f in {~, ~, = 
}, R is ^-applicable in G if (3Gi C G)(R{ R g}l;Gi). 

Definition 2.2.4. Let G be a concrete graph and R a rule. For f in {~ 
,~,=}, G ^-satisfies R (G \=^ R) if for all Gi C G such that there is b\ 

bi 

with R{rs} £ Gi , there exist G 2 E G and b 2 D b\ that satisfy the following 
conditions: 

(») Gr C G 2 ; 

62 

(**) R{rs,gs} f G 2 ; 

(Hi) (VG 3 r G 2 )(G r C G 3 A R {rs} {G 3 - G 3 = Gi). 

Intuitively, G satisfies R if for any subgraph Gi of G matching (with 
respect to £) the red part of the rule (i.e. the pre-condition), there is a way 
to ‘complete’ Gi into a graph G 2 C G such that the whole rule R matches 
G 2 . Condition (in) is necessary to avoid the possibility of using other parts 
of G, matching with R{ R s} independently, to extend Gi. 
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Example 2.2.1. For instance, consider the graphs below. 



2 



(R) 



H H EH GH 



| P |s 



(Gi) 



(G 2 ) 



<g 3 ) 



Rule R is not ^-applicable to G\ (so G\ ^-satisfies R trivially) and G 2 
^-satisfies f?, for £ in =}. Observe the necessity of condition (in), 

in the case of to ensure that G 3 does not ^-satisfy R. As a matter of 
fact, with bi = {(2,5)} Rrs ~ G' where G' consists of the unique node 5. 

However, with b 2 = {(2, 4), (2, 5), (1, 3)} it holds that R ~ G 3 . To achieve this 
bisimulation, however, b± has been ‘unnaturally extended’. This is avoided 
by the last condition of Definition 2.2.4. 

Problems like those seen in Example 2.2.1 come from the fact that func- 
tions can be extended to relations. This is not possible for ~, and = because 
their extensions must be functions, by definition. Thus, Definition 2.2.4 can 
be significantly simplified for ~ and =: 



Lemma 2.2.2. Let G be a concrete graph and R a rule. For f in {~, =}, G 
^-satisfies R (G \=£ R) if and only if for all G\ QG and for all bi such that 

R{rs} £ Gi 3 G 2 C G and 3b 2 D b\: 



1. G\ C G 2 and 
i>2 

R{rs,gs} £ G 2 . 



□ 



The notion of applicability is a pre-condition for an effective application 
of a rule to a concrete graph, whose precise semantics is given below: 

Definition 2.2.5. Let R be a ride. Rs semantics |f?]^ C Q c x Q c is defined 
as follows: for £ in {~, =}, (G, G') £ [R]{ if and only if : 

1. GCG', G' \= ( R, and 

2. G' is minimal with respect to property (1), namely there is no graph G" 
such that GQG", G " C G', and G" hf R- 

Intuitively, a rule, if applicable, extends G in such a way that G satisfies R. 
Moreover, it is required that the extension is minimal. If R is not applicable 
in G, then G satisfies R trivially and (G, G) £ [A]?- 

Example 2.2.2. Consider the graph G and the rules R\ and R 2 of Fig. 2.5. 
The application of Rule R 2 leaves the graph G unchanged. The application 
of Rule R\ uniquely adds the grandfather relation. The application of Rule 
R 2 after that of Rule R\ further adds the grandchild relation. Thus, rule 
application is not commutative. 
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(G) (Hi) (H 2 ) 



Figure 2.5. Non commutativity of rule applications 

Example 2.2.3. Consider graphs of Fig. 2.6. It holds that (G, G\) and (G, G 2 ) 
belong to (G, G 3 ) ^ I-R]~ since G 3 ^ f?: this is due to condition 

(Hi) in the Definition 2.2.4. Notice that G\ ~ G 2 ~ G 3 . 

Definition 2.2.6. Given a G-Log graph G, then for f in {~, == M -MG) 

is a function from the set of the rules to the powerset of the set of G-Log 
graphs, defined as follows: 

lRh(G) =def {G':<G,G')eIi?M 




(G 2 ) (G 3 ) 

Figure 2.6. Application of a rule R to a graph G 

Rules can be combined to build programs according to Definition 2.1.7: 

Definition 2.2.7. Let S = {f?i, . . . , R n } be a set of rules. Then, for f in 
{~,~=},(G,G')elS}sif 
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1. G C G', G' |=£ i?j, for i = 1, . . . , n, and 

2. G' is minimal with respect to property (1). 

Let P be a program {S\, S n ) . For f in{~,~,=}, ( Go,G n } € [P]| 
if and only if there are G\, . . . , G n -i such that (Gi,Gi+ 1 ) € [5^+1 ]{, for 
i = 0 , . . . , n — 1 . 

The following notion is useful in practical querying: 

Definition 2.2.8. Let R be a rule and G be a concrete graph such that G |= 
R. For £ in {~, =}, the £-view of G using R, denoted by G\r is the union 

of all the graphs G' C G such that R f G' . The unfolded £-view of G using 
R (G®\r) is the disjoint union of all the G' . 







Mike Paul 



g\r 



l R 



Figure 2.7. Rules, concrete graphs, and views 



Fig. 2.7 shows a concrete graph G satisfying a rule R, its view using R (in 
this case there is no difference adopting different semantics) , and its unfolded 



view. 
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John Database Dr. Smith John Database Dr. Smith 



(G 0 ) (Gi) 

Figure 2.8. An update and a concrete graph 



2.2.2 Programming in G-Log 

Let us show now how to build up a database using the G-Log language and 
then, how to query it. 

Suppose we want to create a new database which contains informations 
about students , the courses they attend and teachers. We use an uncondi- 
tional rule, i.e. a rule without red part, since we wish to build up the database 
and not to modify an existing one. 

For example, the graph Go depicted in Fig. 2.8 is a G-Log generative 
unconditional query (its color is only green solid); it creates a simple database 
with three entities: John is a student who attends the Database course and 
Dr. Smith is the teacher of the same course. If we apply Go to the initial empty 
concrete graph, we build up the G-Log concrete graph Gi of Fig. 2.8 which 
satisfies Go- Given Gi we can either querying it or add more information 
to it. 




John Database Dr. Smith 




(R) (Gi) 

Figure 2.9. A G-Log rule and a concrete graph 



Now, suppose we apply the rule R of Fig. 2.9 ‘if a person teaches a course 
and a student attends that course then the person is a student’s teacher’ to 
Gi. R is £ -applicable to Gi for £ in {~, ~, =}. As a matter of fact, for each 
£ there is a £- relation between the red solid part of R and a subgraph of Gi; 
therefore, for each £, there is a way to expand the concrete graph in order 
to obtain a new graph G\ of Fig. 2.9 that matches the whole rule. In this 
particular case, the extension is the same. 
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This way, using G-Log rules, we can query a database to obtain informa- 
tion and complete its concrete graph adding new nodes or edges. 

Moreover, G-Log allows the expression of complex queries by means of 
programs which are sequences of rules. Sometimes it is worthwhile to have 
the possibility of expressing transitive properties: in G-Log a set of two rules 
is enough. 



2.3 Basic Semantic Results 

In this section we analyze the main results concerning the proposed paramet- 
ric semantics, in order to point out G-Log rules of a form ensuring desirable 
properties, first of all program determinism. 

2.3.1 Applicability 

Proposition 2.3.1. For each G-Log rule R, 

1. if R is =-applicable, then it is applicable ; 

2. if R is applicable , then it is ~- applicable. 

Proof. Immediate, by definition. □ 

Relations and = have different expressivity and thus they can be 

compared to form an ordering (cf. also, Section 2.6). The ordering is strict, 
as follows from Fig. 2.10: R\ is ^-applicable to G\ only when f is R 2 is 
^-applicable to G 2 for ~ and ~, but not for =. 




f g 2 



Figure 2.10. Applicability differences 



2.3.2 Satisfiability 

The situation is a bit more intricate as far as the concept of satisfiability is 
concerned: 
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Proposition 2.3.2. There are rules R such that the sets {G : G |=^ R}, 
{G : G |=^ i?}, and {G : G \== R} are pairwise distinct. 



Proof. Consider rule R\ and the graph G\ below. G\ does not satisfy R for 
~ and ~. However, since R\ is not =-applicable in G\, trivially G\ |== R\. 

On the other hand, we can note that rule R\ is =-applicable (hence, ~- 
and ^-applicable, thanks to Proposition 2.3.1) to graph G 2 . However, it only 
holds that G 2 \=~ R\ ■ 




(Ri) (Gi) 



S£ 



(G 2 ) 



Now we prove that for some R, {G : G \=^ R} may contain elements that 
are not in the other two sets. Consider rule R 2 and graph G 3 below: 




(R 2 ) <g 3 ) 



R 2 is =-applicable to G 3 but it is not =-satisfied. Similarly, R 2 is ~- 
applicable to G 3 but it is not ~-satisfied, due to the rightmost subgraph. 
Instead, G 3 ~-satisfies 1 ? 2 - □ 

Thus, none of the three sets is included in the other. 

Proposition 2.3.3. Let G be a G-Log graph and R a G-Log rule. For f in 
{~, ~,=}, if R is f-applicable to G, then there is a G' such that 1) G C G', 
and 2) G' R, and G' is minimal with respect to the properties (1) and 
( 2 ). 

Proof. Consider a rule R ^-applicable to G. The existence of a G' fulfilling 
(1) and (2) is clearly ensured: for each G; C G such that G^Rrs (existing 
by hypothesis), consider G' obtained by augmenting Gj with new nodes and 
edges ‘copying’ Rgs- Consider G' = Gil U iG[. The assumption (3) of the 
definition of rule (Definition 2.1.5) ensures that the process cannot enter into 
loop, thus ensuring the finiteness of the graph G' . 

To get one (among the various possible) minimal graph it is sufficient to 
remove some of the new edges and nodes (possibly collapsing them) while 
the satisfiability property still holds. □ 

In other words, if R is ^-applicable to G, then |l?]j(G) is not empty. 




















26 



2. Semantics Based on Bisimulation 



Corollary 2.3.1. For f =}, for any rule R and graph G, it holds 

that [i2] c (G) ^ 0. 

Proof. If R is not =-applicable in G, then [_R]j(G) = (G, G) by definition. 
Otherwise, the result follows from Proposition 2.3.3. □ 



2.3.3 Simple Edge-Adding Rules 

We analyze now the effect of some simple rules, edge-adding rules , and prove 
the determinism of their semantics. First of all we point out an ambiguity 
hidden in the graphical language. 

Consider the following rule in which the green part is composed only by 
one edge connecting two nodes with the same label: 




Its intuitive meaning is: any time you have two entity nodes labeled by A 
(necessarily distinct if we are using the = semantics) add an edge labeled p 
between them, unless one edge of this form is already present. The meaning 
is exactly the same as that of the following rule: 




The first rule, in a sense, hides a cycle of green edges. Notice that this happens 
even if the two rules are not bisimilar: the existence of a bisimulation between 
two rules is not required for the two rules to have the same expressive power. 

Table 2.1 shows the differences of the semantics of rules Ri 1 R 2 ,R?, ad- 
mitting cycles of green edges involving equivalent nodes on simple concrete 
graphs Gi,G 2 ,G 3 . 

We observe that: 

1. The three semantics are all equivalent with respect to rule R .\ ; 

2. For the other rules, there are always differences; 

3. Rules i ?2 and R% may be non- =- applicable for graphs with too few nodes. 
This is due to the constraint on cardinality required by the graph iso- 
morphism relation =; 
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Table 2.1. Effects of ‘cyclic’ rules 



4. The semantics based on bisimulation (~) does not distinguish the three 
rules Ri, i? 2 , and R$. This is due to the possibility given by bisimulation 
(a relation in general — not necessarily a function) to bind one node with 
a family of nodes. 

5. The semantics based on ~ can not distinguish rules R 2 and R 3 ; 

6. The semantics based on = distinguishes all the rules. 

7. In all the examples, the application of rules is a function. 
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Actually, the same conclusions can be drawn whenever all the nodes be- 
longing to a cycle are roots of ^-equivalent and disjoint G-Log graphs. R\ 
and i ?2 are: 

— ~-equi' valent if R\ ~ R 2 , 

— =-equi valent if R\ = R 2 , and 

— ^-equivalent if for all /, i?i~/ if and only if i? 2 ~A 

Let us study some more general properties of rule application for simple 
rules. We begin with the simple cases in which Rqs consists only of edges. 

Lemma 2.3.1. For each G-Log rule R, if Rqs consists only of one edge and 
no nodes, then [i?]j is a function, for f in 

Proof. We need to prove that for each G-Log graph G, there is exactly one 
G' such that (G, G') G [ /2] e . 

If R is not ^-applicable, then the result holds by definition choosing G' as 

G. 

Assume R is ^-applicable in G, for any £ in {~, ~, =}. By hypothesis, R 
can have only one of the following two forms: 




( 1 ) ( 2 ) 

where the nodes labeled by A and B in (1) are distinct nodes (but not nec- 
essarily label A is different from label B) or in (2) they are the same node. 
We prove first the fact when £ is ~. 

Let R be of the form (1). Its semantics is exactly that of introducing all 
possible edges labeled by p connecting subgraphs bisimilar to A , a and B, j3 
of G, unless they are already present. This kind of extension of G is clearly 
unique. 

Let R be of the form (2). As shown in Example 2.2.1, condition (in) in 
Definition 2.2.4 ensures that the only (minimal) way to generate a graph 
satisfying the rule is that of adding a self-loop for each node matching with 
A , a. 

When f is ~ or =, the situation is similar (and easier than for ~ as 
concerns case (2)). □ 

Now we extend the above lemma to the case in which R contains several 
edges. 

Proposition 2.3.4. For each edge-adding rule R, [A]j is a function, for £ 
in {~,~,=}. 
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Proof. R contains n green edges, with n > 0, by definition of rule. Consider 
the rules Ri, i = l,...,n, obtained by removing from R the green edges 
1, . . . , i — 1, i + 1, « . . , n. Each Ri is of the form analyzed by Lemma 2.3.1. 
Let G be a G-Log graph. Consider the following procedure, parametric with 
respect to £ in {~,~,=}: 

G" := G; 

repeat 

G' := G"; 

for * = 1 to n do 

let G" be the result of Ri applied to G" with respect to £ 

until G" = G'\ 

The procedure is clearly terminating. Moreover, by induction on the num- 
ber n of green edges, it is easy to prove that the procedure is ensured to return 
a unique graph G' (use Lemma 2.3.1), that G' ^-satisfies R and it is the min- 
imum graph extending G fulfilling such a property. □ 



2.3.4 Very Simple Queries 



We consider simple rules, actually very used in practice: 

Definition 2.3.1. A rule R is said to be a very simple rule if Rqs consists 
in one node /r and one edge (fxRab^u), with ic{y) = RS. 

In this section we are interested in very simple rules that are queries (very 
simple queries — VSQ). In general, [f?]j applied to a concrete graph G is not 
a function, even when R is a very simple query with respect to G. Consider 
the following diagrams: 6 




Then, |i?i]j(Gi) = {G'^G"}. However, the views of G[ and G" with 
respect to R\ (see Definition 2.2.8) are bisimilar. So we could guess that 
[ • ] j is a function modulo bisimulation, at least with respect to a ‘structured’ 
subgraphs of G, i.e. , graphs filtered by a rule. This does not hold in general, 
as follows from the following example concerning grandfathers, fathers, and 
sons: 

6 As usual, square nodes can be read as Result nodes and outgoing edges as con- 
nects edges. 
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It holds that [ I ?2 = {G' 2 , G 2 }. However, G 2 and G" are not bisim- 

ilar. The uniqueness (modulo bisimulation) is ensured only when the various 
parts of G matching with Rrs are all independent (unfolded views). 

However, a sort of regularity of the semantics of rule application can be 
obtained by considering 



Gf =def U G' 

G'£{R ]b (G) 

Such a graph is unique (up to isomorphism) and it is a sort of skeleton 
from which all elements of |i?]j(G) can be obtained. As an instance, for the 
examples above: 




(cf 1 ) (G* 2 ) 



Lemma 2.3.2. Let G be a G-Log graph and R be a VSQ with respect to G. 
Then , for !; in {~,~,=}, there is a unique graph (up to isomorphism) Gf 
such that: 

1. Gf b R, 

2. VG' G |i?b(G) it holds that G'QGf, and 

3. Gf is minimal with respect to properties (1) and (2). 

Proof. If R is not £- a PPli ca ble to G, then choose Gf as G. Otherwise since, 
by definition of query, no node labeled by result and edge labeled by connects 
is in G, whenever there is a subgraph G' of G such that Rrs^G 1 , add a node 
p and an edge (p, lab, v). Moreover, keep track of all nodes p, v of this kind. 
When all the G' of that form have been processed, add edges from all nodes 
p to all nodes v , uniquely obtaining the graph Gf . □ 
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Proposition 2.3.5. Let G be a G-Log graph and R be a VSQ with respect 
to G. For £ in {~,~,=}, define 

views (G, R , £) = {G ,a | fl . : 3 G' G [fl] c (G)} 

Then for each I\,l2 G views (G, R,f), it holds that I\ is isomorphic to 12 - 

Proof. Assume I\ , I2 G views (G, R, t;), I\ and I2 distinct graphs. This means 
that there are two graphs G\ and G2 in |i?]j(G) such that I\ = G“|^ and 
I2 = G“ |fi- Since Gi G |i?]^(G) it holds that for each G' V G such that 
RnsfG' there is an edge between a Result node (not occurring in G) and a 
node of G. This means, by definition of unfolded view, that a graph exactly 
composed by G' and the just mentioned node and edge is both in /1 and 
in I2 , and, moreover, this is an isolated subgraph of both I\ and I2. Nodes 
and edges are introduced in I\ and I2 only in this way. This ensures that 

h = h- □ 

To reach a more convincing deterministic result for the semantics, we 
suggest to add determinism to the definition: we define the deterministic 
semantics of R: 

[ R\f t {G) = G’ 

for G' G |i?]^(G) and G' contains at most one node more than G. 

Proposition 2.3.6. Let G be a G-Log graph and R be a VSQ with respect 
to G. Fort; in then |i?]^ et (G) is well-defined, i.e., [i?]| et (G) is 

a function. 

Proof. Assume, by contradiction, that G\ , G2 G [ R ] 5 (G) and that they differ 
by G for at most one node. 

If R is not ^-applicable, then Gi = G2 = G by definition. 

Assume R is ^-applicable. Since R is a query with respect to G, no result 
nodes are in G. Thus both G\ and G2 contains exactly one (result) node p 
more than G. Without loss of generality, we can assume that it is the same 
node in the two graphs. New (connects) edges have been introduced from p 
to the various subgraphs equivalent to Rrs- It is immediate to check that an 
edge of this form belongs to Gi if and only if it belongs to G2, unless one of 
them is not in |i?]j(G). □ 

[ i? ]^ e * (G) can therefore be seen as a privileged answer to a query. Actu- 
ally, it contains exactly all the information we need and does not introduce 
redundant nodes. 

We conclude this section with a consideration that explains the rationale 
behind the condition (3) of being a rule. 

Remark 2 . 3 . 1 . Consider the graph R in Fig. 2.11, that does not fulfill re- 
quirement (3) of being a rule. It intuitively says that for all nodes labeled A 
you need to have a node labeled by A connected with it by an edge labeled 
by p. The application of R to the trivial graph G generates a denumerable 
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family G" , G'" , ... of graphs satisfying R. However, none of them is minimal. 
Moreover, notice that the graph G' does not satisfy R, as condition (3) of 
the definition of bisimulation is not satisfied. 




Figure 2.11. Infinite generation 



2.4 Abstract Graphs and Semantics 

In order to represent sets of instances sharing the same structure, we intro- 
duce now the notion of abstract graph. Following the Abstract Interpretation 
approach [38, 51], we see that abstract graphs can be used as a domain to ab- 
stract the computation of G-Log programs over concrete graphs. In fact, Ab- 
stract Interpretation is a theory of semantics approximation which is mainly 
used for the construction of semantics-based program analysis algorithms, 
the comparison of formal semantics, the design of proof methods. As a first 
approximation, abstract interpretation can be understood as a non standard 
semantics, in which the domain of values (usually called concrete values) is 
replaced by a domain of descriptions of values (called abstract values) , and in 
which the operators are given a corresponding non standard interpretation. 

Moreover, our approach to obtain abstract graphs can also be seen as an 
alternative view of reasoning on schemata and instances of a database or a 
WWW site, coherently with the Dataguide approach of [54]. We recall that 
operating on schemata may be useful whenever the user wants to investigate 
on general properties on the structure of a given semistructured database. 

Definition 2.4.1. A (G-Log) abstract graph is a G-Log graph such that: 

1. (Vx € N U E)(£c(x) = black), 

2. (Vx £ N)(£s(x) = _L), i.e., an abstract graph has no values. 

With Q a we denote the set of G-Log abstract graphs. 
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String 



“jString 

NY 



j String J 
Andy 



“^String 

Bob 



(S) (. 1 ) 

Figure 2.12. A schema, an instance, and a rule 



I Person 



Money 



(«') 



Let us use once more the notion of bisimulation to re-formulate the G-Log 
concepts of instance and schema. Intuitively, an abstract graph represents a 
concrete graph if it contains its skeleton while disregarding multiplicities and 
values. 

Definition 2.4.2. A concrete graph I is an instance of an abstract graph G 
if (3 1' □ I){G ~ I') . In this case G is said to be a schema for I. I' is said 
to be a witness of the relation schema-instance. 

In Fig. 2.12 there is an example of application of the definition above. (S) 
represents (/). To build the witness (/'), add to (/) an edge labeled by works 
linking the entity node Person of Bob with the entity node Town. Moreover, 
add edges labeled by lives from the two nodes labeled Person to the node 
labeled Town, and add also an edge reverse to the father edge. It is easy to 
check that a bisimulation from S to I' is uniquely determined. 

The notions of applicability and satisfiability for abstract graphs are the 
same as in Definitions 2.2.4 and 2.2.2. This also holds for the semantics defi- 
nitions for rules and programs. Anyway, the semantics based on bisimulation 
~ is, in a sense, the less precise and, thus, it is the most suited for abstract 
computations. 

The following properties can be immediately derived from the definitions 
above. 

Lemma 2.4.1. 

(a) If I is an instance of G with witness V , then for all I" such that 
I C I" C I' it holds that I" is an instance of G. 

(b) If I is a concrete graph, G is an abstract graph, with I ~ G, then I is 
an instance of G. 

(c) If I is a concrete graph, G,G' are abstract graphs, with G ~ G' , then 
I is an instance of G if and only if I is an instance of G' . 
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We have already observed after Definition 2.1.2 that given a G-Log graph 
Go, the set ({G is a G-Log graph : G C Go},E) is a complete lattice. 7 In 
the rest of this section we assume that every (concrete and abstract) graph 
belongs to this lattice. Under this hypothesis we may properly deal with the 
C relation between graphs. 

A Galois connection ([38]), which is used in our context to approximate 
in the best way any concrete graph by an abstract one, between Q ^ and 
p(G C ) can be obtained by considering the concretization function 7 : — > 

p(G c ) : 

7(G) = {I ■ I is an instance of G} 
and its adjoint abstraction function a : p(G c ) — > G'fh, defined by 

a(S) = U{Ge^ : 7(G) C S}. 

Algorithmically, given a set of instances S, we may build up its abstraction 
a(S) by taking the union of all their nodes and edges. Moreover, applying 
standard techniques, we can build the minimum graph ~-equivalent to that 
graph. This graph is unique up to isomorphism, and it can be computed 
without knowing Go- Using the techniques in [79], this simplification can be 
performed in time 0(m log n + n), where m is the number of edges and n the 
number of nodes. 

The abstraction function for rules can be obtained exactly in the same 
way as for concrete graphs. Thus, when R is a rule, a(R) denotes the graph 
obtained by deleting values (i.e., is = _L) from R and then computing the 
minimum graph ~-equi valent to it. 

The following two auxiliary results will be useful in order to prove mono- 
tonicity and injectivity of the function 7 just defined. 

Lemma 2.4.2. If Gx = {Nx,Ex,h) E G 2 = {N 2 ,E 2 ,£ 2 ) and G, - 6" = 
( N',E',t '), then there is G" □ G' such that G" ~ G 2 . 

Proof. Without loss of generality, assume that N' fl N 2 = 0. Let b' such that 
Gi ~ G'. Let N" = N', E" = E' , I" = b" = b' . 

— for all n £ N 2 \ Ni let N" = N" U {?r}, b" = b" U {(n,n)}, and I" = 
i" U {(n,e 2 (n))y, 

— for all (m, A, n) in E 2 \ Ex, let E" = E" U {(/x, A, v) : mb" p, nb"u}. 

It is immediate to check that G 2 ~ G" = (N " , E " , t") . □ 

' Assume, for instance, to deal with Web sites. Go can be chosen as the graph 
obtained by disjoint union of all the instances and schemata of all Web sites. 

8 Given two posets (A, < A ) and (C,<°), a Galois connection is a pair of maps 
(a, 7) such that: a : A 1 — > C, y : C i—> A and Va £ A, Vc G C : a(a ) < c c <£=> a < A 
7(c). 
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Lemma 2.4.3. If G C G\ ~ Gi C G3 ~ G, i/ien G ~ Gi ~ G2 ~ G3. 

Proof. It is sufficient to prove that G ~ Gi; the remaining part of the claim 
follows by the fact that ~ is an equivalence relation. Let a and b be the two 
bisimulations such that Gi ~ G2 and G3 ~ G. By Lemma 2 . 4 . 2 , there is Go 

such that Gi C Go and Go ~ G2, where a ' extends a as in the proof of that 
lemma. We will refer to a 1 simply as a and we call c = a o b. It is easy to 
verify that c is monotonic; thus we can build two infinite descending chains: 

G\ □ c(Gr)p G) □ c(c(Gr)) □ c(c(c(G!))) . . . 

Go □ c(G 0 )(= G) □ c(c(G 0 )) □ c(c(c(G 0 ))) . . . 

Since the graphs are finite, there must be an integer n such that 
c n Gi = c" +1 Gi A c n G 0 = c n+1 G 0 

(assuming that the graph G is nonempty, the fixed points above must be non 
empty). Moreover, by construction, it holds that: 

G 1 ~ c(Gr) ~ c(c(G!)) - c(c(c(Gr))) . . . 

Go - c(G 0 )(= G) ~ c(c(G 0 )) ~ c(c(c(G 0 ))) . . . 

In particular, Gi ~ c n (Gi) and G ~ c n (G 0). Since the application of relation 
c is monotonic, it holds that 



c*(Gr) C c l (G 0 ), 

Thus, in particular, c"(Gi) C c n (Go). On the other hand, since c(Go) = G 
and G G Gi, it holds that 



c i+1 (G 0 ) C c*(Gr) . 

Thus, c n (G 0) = c” +1 (Go) C c"(Gi). This means that c n (G 1) = c n (Go) and, 
moreover, that G ~ c"(Gi) ~ Gi. □ 

Theorem 2.4.1. Function 7 is monotonic, i.e. for any pair of abstract 
graphs G, G' , GCG' implies 7 (G) C 7 (G'). 

Proof. Let / € 7(G). By definition of 7, there exists /' □ / such that /' ~ 
G C G'. By applying Lemma 2 . 4 . 2 , there exists I" □ I' such that /" ~ G'. 
By transitivity of the ordering relation, we get I" □ /. Hence, I G j(G'). □ 

Theorem 2.4.2. Function 7 is injective, i.e. for any pair of abstract graphs 
G, G', G^G' implies 7 (G) ± 7 (G'). 

Proof. Assume that 7 (Gi) = 7(G2), and let I\ G 7(Gi) with Ji ~ Gi. By 
the assumption, I\ G 7(G2) too. Hence, by the definition of 7, 3 I[ □ I\ such 
that I' ~ Gi. Therefore, 
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G 2 . 

Now, let I 2 G r ){G 2 ) with I 2 ~ G 2 . By the same reasoning, there exists I' 2 3 I 2 
such that I 2 ~ G\. Therefore 

Gi ~ h E I[ ~ G 2 ~ h E /a ~ Gi- 

By lemma 2.4.3 we immediately get G\ ~ G 2 , concluding the proof. 

□ 

Theorem 2.4.3. (Correctness) Let G,G' be abstract graphs and R a rule 
such that ( G,G' ) € |a(i?)]^. If I £ 7 (G) and {1,1') G i/jen I' G 

7 (G'), i.e., the following diagram commutes: 

Q a (R) £ 7 / 

I 7 I 7 

I 4 /' 

Proof. By the hypotheses and by Lemma 2.4.2, there exist I and /' such that 
the following diagram holds: 



R 


H 


r 


I' - 


- G' 


b 


a(R) 






u 


u 


u 






R \{RS} 


c 


/ C I r 


- G 


□ 


®{R)\ {RS} 



By the definition of satisfiability of the rule i?, we may build I" such that 
/' C I" ~ /'. In order to build such a extend /' only with arcs and 
nodes belonging to I': minimality conditions on G' (and thus on I') avoid 
redundancies. Hence, from the diagram above we get I' C I" ~ I' ~ G', i.e. 
/' G 7 (G'). 

□ 

Theorem 2.4.3 guarantees the correctness of abstract computations: the ap- 
plication of a rule abstraction to an abstract graph safely represents the 
application of the corresponding concrete rule to any of its instances. The 
practical impact of this result is quite interesting. Consider the abstract graph 
S and the rule R' in Fig. 2.12. Since a(R') is not applicable to S, we can 
immediately conclude that the same rule is not applicable to any instance of 
S. Therefore, we may apply rules to abstract graphs in order to build com- 
plex queries, and then, once checked that they are applicable to the abstract 
graph we can turn to the concrete cases to get the desired answer. This is 
particularly interesting when the instance resides on a remote site. 

Moreover, suppose we use G-Log rules to specify site instance evolution 
during the site life. Then, the application of the same rule to the site schema 
returns automatically the schema corresponding to the new site instance. 

9 Of course, in this context we are interested in those site updates that would 
affect the schema, since schema- invariant updates do not need to be traced. 
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Remark 2.^.1. According to standard definitions of schema, the concept in- 
troduced in Definition 2.4.1 may be further enforced with the condition: 

3. (\/x, y £ N){(.t{x) ^ £r(t/) V lc(x) ^ £c(y)), be. there is no repetition of 
nodes. 

In this case, given a set S of instances, we may build up its abstraction a'(S) 
computing a(S) and then by collapsing all the nodes in it having the same 
type and label. The same technique can be applied to a rule R. a ' can be 
seen as an abstraction less precise than a; by construction, it holds that 
a , (a(S')) = at'(S). The concretization function 7 remains the same. 

When R is a query, Theorem 2.4.3 still holds using a! in place of a. In 
Fig. 2.13 we present the concrete graph I, the abstract graph G = a(I) 
and the schema S = a'(I). For each rule Ri , a(Ri) = a'(Ri) is obtained 
by removing the concrete value label (in these cases, Udine and Verona). It 
holds that: 

-{/'} = 1^1(7), { 1 } = |i? 2 ](/) = [# 3 ICO; 

- { G '} = [a(f?i)](G) = [a(ifc)](G), {G} = [a(i? 3 ) ]G; 

- {S'} = Wi^mS) = [a'( J R 2 )](S) = [a'(i? 3 )](S). 

When, as in this case, rule application is deterministic, the two levels of 
abstractions can be summarized by the following diagram: 



s 


oRR) 


Sf 

Ui 


t«' 




a'(G f ) 

Ta' 


G 


a(R) 


Gf 

Ui 


T a 




«(ff) 

{ a 


I 


_R 


If 



This leads to a hierarchy of abstraction in the spirit of [51]. 



2.5 Logical Semantics of G-Log 

Aim of this section is to provide a model theoretic characterization of the 
language G-Log. First, we show how to automatically extract a first-order 
formula from a G-Log graph. Then we show that G-Log concrete graphs 
are simply representations of Herbrand structures: they are models of the 
formulae associated with the rules they satisfy. 

As said in Section 2.1.2, result nodes and their outgoing edges la- 
beled by connects are represented without writing explicitly the labels. 
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Grado Udine 
I 




Grado Udine 







Udine 






Verona 




Verona 



(Ri) (R2) 

Figure 2.13. Two levels of abstraction 



(*3) 



We write <j)(xi , . . . , x n ) to denote that <f> is a first-order formula with free 
variables among x±, . . . , x n . Moreover, [£j^{n)\{x\, . . . , x n ) denotes the atom 
p(x 1, . . . , x n ) where p is Similarly for £c and 

2.5.1 Formulae for G-Log Rules 

In this subsection we describe how to obtain a first-order formula from each 
G-Log rule without negation either in its nodes or in its edges. 

Definition 2.5.1. A G-Log formula is a closed first-order formida of the 
following form: 
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Vxi • • • x h (-Bi(xi, . . . , x h ) -> 3^i • • • z k B 2 (xi, . . . , x h , zi, , . . , z*)) 

where Xi,Zi are variables and the Bi are conjunctions of atoms. 

Remark 2.5.1. Observe that the existential quantification of the variable on 
the r.lr.s. of the implication causes that the formula can not be encoded as a 
simple Horn clause of DATALOG. 

We show how to associate a G-Log formula to every solid G-Log rule. We 
begin with the semantics based on directional bisimulation then we show 
how to modify the technique according to the other two semantics. 

Definition 2.5.2. Let R = ( N,E,£ ) be a G-Log rule; we define the G-Log 
formida < P r fi and the formula tUff as follows: 

1. Vn £ N associate a distinct variable u(n). 

2. Vn € N let ip n be the formula: 

[tc{n)\(v(n)) A \(.r{n)\{v{n)) A [ e s (n)\(v(n )). 

If I s(n) = _L, then the last conjunct is omitted. 

3. Ve = (m, (c,£c(e)),n) € E let (f e be the formula: 

[£c(e)\(v(m),v(n)). 

4- Let n\, . . . ,rih be the nodes of N such that f'c(e) = RS, 

5. Let n[, ... , n' k be the nodes of N such that t'c(e) = GS, 

6. The formula is: 



Vn(ni) • • • u{n h ) 



A A A 

\neN,e c {n)=RS e£E,e c (e)=RS / 

/ A 

MO' ■■»{<) ^cin^GS 

A v* 

\ e6 E,e c (e)=GS 




1. The formula Eff is: 



3 n(m) • • • v{n. h ) A A 

\n£N,l c (n)=RS e£E,l c {e)=RS J 

For instance, the formula associated to the rule R depicted in Fig- 
ure 2.14 is: 10 

y f Person(xi) A Town(x 2 ) A Lives(xi, X 2 ) A Town(x 3 )A \ 

1 2 3 y Studies(xi, X3) — > 3z\ (result ( 21 ) A connects (zi, xi)) J 



10 



We omit the type information for the sake of readability. 
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(R) 



Figure 2.14. G-Log rule 



Logical formulae corresponding to G-Log graphs are different if we study 
the other two semantics. With the semantics based on the concept of graph 
isomorphism =, rule R of Fig. 2.14 represents the query ‘collect all the people 
living and studying in two different towns’. 

In the construction of we need to force the fact that the two towns 
must be distinct. This can be done by adding an inequality constraint between 
the variables identifying the nodes X 2 and x 3 : 



( Person(xi) A Town^) A Lives(a;i, X 2 ) A Town(£ 3 )A \ 
Studies(a;i, xff) A X 2 yf £3 — * I 

3zi (result ( 24 ) A connects(zi, X\)) J 

More generally, we will require that all nodes of the graph are distinct: 

Definition 2.5.3. Given a G-Log rule R = ( N,E,£ ), and the formula 

= Vt'(ni) • • • v{n h ) {Bi — > 3i ffn[) ■ ■ ■ v(n' k )B 2 ) , 

then the formula ‘Ljj is: 



Vz'(ni) • • • v(n h ) 



B\ A j\ v(m) yf v(rij) 

( B 2 A f\ v{n[) yf v(p!j) \ 

3u(n' 1 ) ■ ■ ■ v(n' k ) 



V 



l<i<j<k 

A f\ v(m) yf 



\ 



) 



Similarly, formula can he obtained by adding Ai <i<j<h u ( n i) yf ^(ny) to 
the conjunct s of L'ff . 

Consider now the semantics based on the rule R, and the concrete 
graph G of Fig. 2.15. R is ^-applicable (Definition 2.2.3) to (G) but is not 
^-applicable for £ in {=, ~}. 
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(B) (G) 



Figure 2.15. A G-Log rule and a concrete graph 



The G-Log formula <P ^ : 

_ f Person(xi) A Student (£2) A Course^) A Teaches(xi, £3) \ 
x 1X2 X3 ^ a Attends (X2, X3) — > Teacher(xi, X2) ) 

represents the query ‘if a person teaches a course and a student attends the 
same course, then the person is that student’s teacher’. The semantics based 
on bisimulation requires a weaker condition, since the constraint ‘the same’ 
cannot be forced. This means that, in the ^-semantics the rule requires that 
whenever a student attends a course and a person teaches some (other?) 
course, the person is that student’s teacher. 



Definition 2.5.4. Given a G-Log rule R, we define the unfolding of R, 
briefly unf(R), to be the graph obtained by replacing every subgraph of R 
of the form 



with the subgraph: 




Then we set 4>£ = and = Knf(R)- 

For instance, the formula for the rule R of Fig. 2.15 is: 



\/x 1 X 2 X 3 X 4 



Personal) A Student^) A Course(£3)A 
Course^) A Teaches(xi, X3) A Attends(x2, £4) 
Teacher(xi, X2) 



that does not constraint the courses to be the same. 
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In Section 2.5.3 we formally prove that the logical semantics of rules we 
are describing is consistent with the semantics defined in Section 2.2. 

2.5.2 Concrete Graphs as Models 

In this subsection we show, independently of the operational rule, how to 
obtain a first-order formula, from a concrete graph; in particular, we show 
that concrete graphs are representations of the least Herbrand model (modulo 
isomorphism) of the Skolemization of that formula. 

Definition 2.5.5. Let G = ( N,E ,£ ) be a concrete graph, As in Defini- 
tion 2.5.2, to every n £ N = {m, . . . , nh} associate a variable i fin) and 
to every node n and edge e associate the formidae (f n and Lp e , respectively. 
Then, the formula T>g associated to G is: 

3v(ni) ■ ■ ■ u(nh) /\ <p(n) A A <^ e ) 

neN e£E 

Definition 2.5.6. Let G = ( N,E ,£ ) be a concrete graph. We associate to G 
the structure Mq = ( D , I) built as follows: 

1. for every node n £ N introduce a constant c n ; let D be the set {c n : n € 
N}. 

2. I(p(c n )) = true if and only if p(v{n)) is a conjunct of^c- 

3. I(p(c m ,c n )) = true if and only if p(u(m),u(n)) is a conjunct of^c- 

Mq is the least Herbrand model of the Skolemization of the formula <Pg- 
For example, let G be the concrete graph of Fig. 2.15, and Cj, with i = 
1, . . . , n, the constants introduced for the nodes of the concrete graph. Then 
Mq can be expressed by the set of facts that are true: 

Person(ci), entity (ci), Course^), entity(c 2 ) 

String(c 3 ) , slot ( 03 ) , Physics(c 3 ) , Student (cfi) , entity ( 04 ) , 
Course(c 5 ),entity(c 5 ) String^), slot(ce), Data Base(ce), 

Teaches(ci, C 2 ), Name(c 2 , C 3 ), Attends(c 4 , C 5 ), Name(c 5 , cq) 

2.5.3 Model Theoretic Semantics 

In this subsection we highlight the relationships between the semantics of Sec- 
tion 2.2 and the logical view of G-Log graphs presented in Subsections 2.5.1 
and 2.5.2. 

Proposition 2.5.1 (Applicability). Let G be a concrete graph and R a 
rule. Then R is f -applicable to G if and only if Mq |= 
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Proof. We prove first the claim when £ is The fact that R is ~-applicable 
to G means that there is Gi C G such that Rrs^Gi. Thus, there is a 
function / from the nodes of Rrs to those of Gi fulfilling the requirements 
of Using that / we find exactly the constants c t of the domain D obtained 
by skolemization of <Pg to be assigned to the existentially quantified variables 
v(nf) of SPff to ensure that Mq b ■ Similarly, starting from an assignment 
ensuring Mq |= we can build a function / such that Rrs^G i for some 
Gi C G. 

To conclude the proof, notice that when computing 4 and 4 we have 
kept into account the constraint to map distinct nodes into distinct objects, 
and the possibility given by the unfolding of a node to be mapped into distinct 
objects, respectively. □ 

Proposition 2.5.2 (Satisfiability). Let G be a concrete graph and R a 
rule. Then G f -satisfies R if and only if Mq f= 4- 

Proof. We prove first the claim when £ is ~. G ^-satisfies G when for all 
Gi CG such that Rrs^G i there is a G 2 3 Gi such that f?~G 2 - But this is 
exactly the meaning of the formula <P R . 

To conclude the proof, notice that the way to compute 4 when f is ~ 
or = ensures that the result holds. □ 

Proposition 2.5.3 (Rule application). Let G be a G-Log concrete graph 
and R a rule. Then, for £ € {~ ~, =}, 

G' e [i?]j(G) - (M G ^ <4 A G = G')v 

(Mg b 4 A Mg b 4 A G = G')V 

(Mg b 4 a m g b 4 AG'/GA Mg' b $r) 

Proof. The proof is by case analysis. Assume G' € |f?]^(G). 

1. If R is not ^-applicable to G then G' = G by definition. From Proposi- 
tion 2.5.1 it holds that Mq b 4- Since the l.lr.s. of the implication of 
4 is false, then trivially Mq b 4- 

2. If, conversely, R is ^-applicable to G (and from Proposition 2.5.1 Mq b 
4) either: 

a) G ^-satisfies R (and thus, by Proposition 2.5.2, Mq b 4)> or 

b) G does not ^-satisfy R (and thus, by Proposition 2.5.2, Mq b 4)- 
In the former case G' — G; in the latter there is a G' □ G such that G' 
^-satisfies R. Thus, by Proposition 2.5.2, Mq' b 4- 

□ 

Notice that it can be the case that 

Mg b 4 a Mg b 4 A G' b G A Me b ^ R 

but G' ^ |i?]|(G). This happens when G' is not a minimal extension of 
G. Thus, the converse direction of tlr above proposition is not always true. 
However: 
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Corollary 2.5.1. Let G be a G-Log concrete graph and R a rule. Then , for 

$ € ~,=}, 



3G'(G' € [i?] 5 (G) AG' ^G) <-► 



Mq b ^ A Mq b ^f?A ^ 
(3 G'UG)(M G '\=$r) ) 



Proof. The (— >) direction follows immediately from Proposition 2.5.3. As- 
sume now that Mq \= ^ Mg b A (3G' □ G)(Mc \= &r). From 

Propositions 2.5.1 and 2.5.2 we know that R is ^-applicable to G and G does 
not ^-satisfy R- Then, by Proposition 2.3.3 this is sufficient to ensure that 
there is a minimal G' extending G and ^-satisfying R. □ 

To sum up, given a rule R and a graph G, the model-theoretic interpre- 
tation of the rule application is that of finding a (minimal) G' □ G such that 

Mg 1 b ®r- 

More generally, the effect of the application of the consecutive rules 
R \ , . . . , R n to an initial concrete graph G is that of producing a (non- 
deterministic) path of the form: 



G ^ Gi §$■ • • • 2? G n 



where Mc t \= for all j <i. 

As a final remark, also for abstract graphs it is possible to develop a logical 
semantics. However, while a concrete graph leads to an existential formula, 
an abstract graph leads to universally quantified formulae in which a lot of 
closure properties are to be stored. The difficulty of handling those formulae 
requires further work. 



2.6 Relationship with the Original G-Log Semantics 

In this subsection we wish to point out the connections of a bisimulation- 
based semantics with the embedding-based semantics of [83, 81]. To complete 

b 

the Definition 2.2.2, b is a directional pseudo-bisimidation , denoted by Go 
Gi , if there is a function b from the nodes of Go to the nodes of G i fulfilling 

conditions 1 and 2 of the definition of bisimulation and, moreover, condition 

b 

3 for i = 0. We say that Go«C^Gi if there is a b such that Go Gi. is 
used to build a semantics based on the notion of embedding as given in [81]. 

It is immediate to extend the Proposition 2.3.1 proving that if R is ~- 
applicable, then it is <C^- applicable. However, also this implication is strict: 
in Fig. 2.16 it is represented a rule R applicable to a graph G only by this 
semantics. 
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Thus, the naturally induced ordering among relations is depicted by: 



I 



/ \ 

~ 

\ / 

J 

The bottom element, say J of the above graph exists: GJ G' if there 

is a relation (not necessarily a function!) b fulfilling conditions 1 and 2 of 
the definition of bisimulation and, moreover, condition 3 for i = 0. The first 
Example of Fig. 2.10 denotes a case of applicability for ~ but not for <C^. 




Figure 2.16. R is <?G -applicable to G 



2.7 G-Log Graphs with Negation 

The semantics introduced in this chapter can be extended in order to deal 
with rules and programs with negation (i.e. , containing red dashed nodes and 
edges — cf. Section 2.1.1). 

Intuitively, dashed edges express negative information; consider, for in- 
stance, R and G as in Fig. 2.17. R intuitively means ‘collect all the people 
that do not live in a town named Verona’. The fact that graphs G’ and G" 
satisfy R can be formalized by extending the definitions of [81] concerning 
negation according to our semantics. However, 

— G' can be obtained from G by using a sort of failure rule or, almost 
equivalently, by applying the Closed World Assumption: we infer that ‘Mike 
does not live in Verona’ from the fact that we cannot derive that this fact 
is true. 

— G" is obtained by adding the hypothesis ‘Mike lives in Verona’ that ensures 
that no subset of G fulfills R. 
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This kind of non-determinism is dealt with in [81] and [83] by limiting 
the G-Log programs to queries. 




(R) (G) (G') (G") 

Figure 2.17. Negation as failure 



2.8 Computational Issues 

During the implementation of the semantics of the language G-Log, two typ- 
ical graph problems must be faced: given two graphs G\ and G 2 , 

1 . to verify if Gi£G 2 , and 

2 . to verify if there is G 3 C Gi such that G^G 2 - 

In the case of the relation =, problem 1 is known as graph isomorphism, 
while problem 2 is the subgraph isomorphism. The former is in NP but still 
it is not known whether it is iVP-complete or it is in P. 11 The latter in 
iVP-complete [53]. 

In the case of bisimulation (~), in general problem 1 is polynomial 
( 0 (m log n + n), where m is the number of edges and n the number of nodes.; 
see, e.g. [79]). In particular cases this problem has a linear solution [45]. 
However, the second one is again NP-complete [44] . 

In Chapter 3 we will show that for some kinds of graphs we can improve 
the performances of the old implementation of G-Log based on the notion 
of embedding. In fact, model-checking allows to apply a polynomial time 
procedure to solve a subset of graphical queries on semistructured databases. 



2.9 Other Languages for Semistructured Data 

In this Section we briefly describe two well-known languages for semistruc- 
tured databases that are respectively based on SQL and graphical queries. 

11 Actually, it is one of the candidates for membership in the (hypothetical) inter- 
mediate class NPI [53]. 
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2.9.1 UnQL 

UnQL is a language for querying data organized as a rooted directed graph 
with labeled edges [12]. According to the definition in [12], a rooted labeled 
graph is a tuple G = (N,E,s,t,is,£), where ( N,E,s,t ) is a multi-graph, 
s,t : E — > N denote the source aud the target of each edge, respectively; 
is £ N is the root of the graph, and l : E — > Label U{_L} is a labeling function 
for edges. 



1 






Figure 2.18. An UnQL graph, called DB 



Some simple UnQL queries on the DB database of Fig. 2.18, which rep- 
resents with a rooted labeled graph a relational database, are the following. 
The expression 



select t where R1 => \t DB (2-1) 

computes the union of all trees t such that DB contains an edge Rl => t 
emanating from the root. There is only one such edge in DB and therefore, 
this query simply returns the set of tuples in Rl. The returned expression is: 
{ Tup {A =>■ 1, B =>■ 2, C => 3}}. 

The expression 



select t where \l => \t <— DB (2-2) 

uses a label variable \l to match any edge emanating from the root. The 
result is the union of all tuples in both relations. 

The UnQL query 

select {Tup => { A => x, D =>• z}} 

where Rl => Tup => {A => \x, C => \y} <— DB, (2.3) 

R2 => Tup => {C =>\y,D=> \z} DB 

computes the join of Rl and R2 on their common attribute C and the pro- 
jection onto A and D. It is easy to check that its application to the database 
DB has an empty result. 
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From the previous examples we note that more expressive power is needed 
in order to look for data whose depth in the graph is not predetermined by 
any schema. A simple query that looks arbitrarily deep into the database to 
find all edges with a numeric label is the following: 

select {/} where _* =4- \l =>■ _ <— DB, isnumber(l ) (2-4) 

Here the _* is a “repeated wildcard” that matches any path in the DB tree. 

This brief introduction about the main features of UnQL will be used in 
Chapter 3 to show the applicability of the model-checking based approach to 
the data retrieval activity for SQL-like query languages. 



2.9.2 GraphLog 

GraphLog is a query language based on a graph representation of both data 
and queries [33]. Query graphs represent patterns, corresponding to paths 
in databases, that must be present or absent in database graphs, moreover 
they define a set of new edges (i.e. new relations) that are added to the 
database graphs whenever the path is found. GraphLog queries are sets of 
query graphs. 

For example, the graph I in Fig. 2.19 is a representation of a flights 
schedule database. Each flight number is related to the cities it connects (by 
the predicates from and to, respectively) and to the departure and arrival 
time (by the predicates departure and arrival). 

Formally, GraphLog represents databases by means of directed labeled 
multigraphs. According to the definition in [33] a directed labeled multigraph 
G is a tuple {N,E,Ln,Le,l,v,c) where iV is a finite set of nodes, E is a 
finite set of edges, L n is a set of node labels, Le is a set of edge labels, t, the 
incidence function, is a function from E to N 2 that associates with each edge 
in E a pair of nodes from N, v, the node labeling function, is a function from 
N to Ln that associates with each node in N a label from Ln, and finally e, 
the edge labeling function, is a function from E to Le that associates with 
each edge in E a label from Le- 




(I) (G) 

Figure 2.19. GraphLog representation of a database and a query 
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Moreover, in GraphLog a graphical query is a set of query graphs, each 
of them defining (constructive part) a set of new edges (relations) that are 
added to the graph whenever the path (non-constructive part) specified in 
the query itself is found (or not found for negative requests). More in detail, 
a query graph [33] is a directed labeled multigraph with distinguished edges 
(they define new relations that will hold, after the query application, between 
two objects in a chosen database whenever they fulfill the requirements of the 
query graph), without isolated nodes, and having the following properties: 

1. the nodes are labeled by sequences of variables; 

2. each edge is labeled by a literal (i.e. positive or negative occurrence of 
a predicate applied to a sequence of variables and constants) or by a 
closure literal , which is simply a literal followed by the positive closure 
operator; 

3. the distinguished edge can only be labeled by a positive non-closure lit- 
eral. 




Figure 2.20. The descendants of PI which are not descendants of P2 



For example, graph (G) in Fig. 2.19 requires to find in a database two 
flights FI and F 2 departing from and arriving to the same city G, respec- 
tively. The edge ‘result’ will connect FI and F 2. 

Fig. 2.20 shows another query graph. There are several different kinds of 
literals labeling the edges. The literal “not-desc-of(P2)” is the (necessarily 
positive) literal labeling the distinguished edge. The literal “descendant-1-” 
labeling the edge between FI and P 3 is a positive closure literal. The literal 
‘Gdescendant+” labeling the edge between P 2 and F3 is a negative closure 
literal (the negation is represented by crossing over the edge with that lit- 
eral) . Intuitively, this query graph expresses the query that returns a ternary 
predicate “not-desc-of(Pl,P3,P2)” with the descendants P 3 of person PI 
who are not descendants of person P 2. 
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This short introduction to the language GraphLog will be used in Chap- 
ter 3 to show the main limits of the approach based on model-checking tech- 
niques for solving graphical queries on semistructured data. 




3. Model-Checking Based Data Retrieval 



As we said in the previous chapters, semistructured data are often represented 
by using data models based on directed labeled graphs [13, 84, 18]; users can 
retrieve information of interest with graph-based queries [50, 33, 81] or with 
extended SQL languages [12, 4, 16]. In both cases, the data retrieval activity 
requires the development of graph algorithms. In fact, queries (graphical or 
not) are expected to extract information stored in labeled graphs. In order 
to do that, it is required to perform a kind of matching of the “query graph” 
with the “database instance graph”. More in detail, we need to find sub- 
graphs of the instance of the database that are somehow similar to the query 
graph. In Chapter 2 we have shown that even if the problem of establishing 
whether two graphs are bisimilar or not is polynomial time [60, 79], the task 
of finding subgraphs isomorphic or bisimilar is NP-complete [44] and hence, 
not applicable to real-life size problems. 

Graphical queries can be easily translated into logic formulae. Techniques 
for translating graphs in formulae have been exploited in literature [14] . The 
novel ideas of this chapter are to associate a modal logic formula T to a 
graphical query, and to interpret database instance graphs as Kripke Tran- 
sition Systems (KTS). We use a modal logic with the same syntax as the 
temporal logic CTL ; the notion of different instants of time represents the 
number of links the user needs to follow to reach the information of interest. 
In this way, finding subgraphs of the database instance graph that match 
the query can be performed by finding nodes of the KTS derived from the 
database instance graph that satisfy the formula Tf . This is an instance of the 
model- checking problem, and it is well-known that if the formula T belongs 
to the class CTL of formulae, then the problem is decidable and algorithms 
running in linear time on both the sizes of the KTS and the formula can be 
employed [27]. 

In particular, we identify a family of graph-based queries that represent 
CTL formulae. This is very natural and, as immediate consequence, an ef- 
fective procedure for efficiently querying semistructured databases can be 
directly implemented on a model checker. We use a “toy” query language 
called W. It can be considered as a representative of several approaches 
in which queries are graphical or can easily be seen as graphical (cf. e.g. 
Lorel [4], G-Log [81], GraphLog [33], and UnQL [50]). We will relate W to 
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UnQL, GraphLog and G-Log and show the applicability of the method for 
implementing (parts of) these languages. The model-checking technique for 
data retrieval could also be applied to non-graplrical query language. Formula 
extraction from SQL-like queries can be done using similar ideas. We have 
effectively tested the approach using the model-checker NuSMV. 

The structure of the chapter is as follows. Sections 3.1 and 3.2 introduce 
basic definitions of model-checking and of the language W. In Section 3.3 we 
show how to assign a KTS to a database instance, while in Section 3.4 it is 
shown how to extract CTL formulae from query graphs. In Section 3.5 we 
report some basic results on the complexity of the data retrieval problem by 
using our approach. A complete example with experimental results is given in 
Section 3.6, while Section 3.7 is devoted to apply the method to the languages 
UnQL [12], GraphLog [33] and G-Log [81]. To conclude, in Section 3.8 we 
sum up the main results of the proposed approach. 



3.1 An Introduction to Model-Checking 

In this section we recall the main concepts and results of model-checking and 
we introduce the model-checking problem for the branching time temporal 
logic CTL [49]. In Section 3.1.2 we report the linear time algorithm for veri- 
fying that a finite-state concurrent system meets a specification expressed in 
CTL introduced in [27]. 



3.1.1 Transition Systems and CTL 

In the last two decades model-checking [27, 73] has emerged as a promising 
and powerful approach to automatic verification of systems. Intuitively, a 
model-checker is a procedure that decides whether a given structure M is 
a model of a logical formula ip. More in detail, M is an (abstract) model 
of a system, typically a finite automata-like structure, and ip is a modal or 
temporal logic formula describing a property of interest. 

Model-checking typically depends on a discrete model of a system, in fact 
the system’s behavior is (abstractly) represented by a graph structure, where 
the nodes represent the system’s states and the arcs represent possible tran- 
sitions between the states. Graphs alone are usually too weak to provide an 
interesting description, so they are annotated with more specific information. 
A very common approach is to use Kripke Structures where the nodes are 
annotated with so-called atomic propositions or Kripke Transition Systems, 
where the nodes are annotated with atomic propositions and the arcs are 
annotated with actions. 

Definition 3.1.1. A Kripke Transition System (KTS) over a set IT of 
atomic propositions is a structure K. = (E,Act,lZ,I), where S is a set of 
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states, Act is a set of actions, 1Z C £ x Act x £ is a total transition relation, 
and I : £ — » p(Z7) is an interpretation. 

According to [73], we assume that 77 D Act = 0. 




Figure 3.1. A Kripke Transition System 



Figure 3.1 shows a Kripke Transition System: the initial state satisfies the 
atomic proposition “A” and has a transition labeled “I” to a state where the 
property “B” holds. From this last state there are two different transitions 
driven by actions “m” and “n”, and so on. 

The implementation of some model-checkers (e.g. NuSMV [26]) works 
with the structurally more simple Kripke Structures, but any Kripke Tran- 
sition System induces in a natural way a Kripke Structure which codes the 
same information. 

Remark 3.1.1. KTS are very expressive structures modeling modal logic. 
Kripke Structures (KS) are KTS in which transitions are not driven by ac- 
tions. Standard techniques can be used to transform KTS in KS (see, for 
example, [73]). The idea is to associate the information about the action 
exchanged in a transition with the reached state instead of the transition 
itself. 

The interpretation I in a Kripke Transition System defines local properties 
of states. Often we are also interested in global properties connected to the 
transitional behavior. For example, we might be interested in reachability 
properties, like “can we reach from the initial state a state where the atomic 
proposition P holds?” . Temporal logics [49] are logical formalisms introduced 
for expressing such properties. 

There are two main kinds of temporal logics, linear-time logics and 
branching-time logics. Linear-time logics are concerned with properties of 
paths. On the other hand, branching-time logics describe properties that de- 
pend on the branching structure of the model. 

Computational Tree Logic ( CTL ) was the first temporal logic for which an 
efficient model-checking procedure was proposed [27]. Its syntax is as follows. 
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Definition 3.1.2. Given the sets 77 of atomic propositions and Act, CTL 
formulae are recursively defined as follows: 

1. each p £ 77 is a CTL formula; 

2. if (p i and (p 2 are CTL formulae, a C Act, then ~<(pi, p i A p 2 , KX a (p\), 
EX a (ipi), AU a (<£i, 1 P 2 ), and EU a (y?i, ^ 2 ) are CTL formulae. 1 

The length of a formula is determined by counting the total number of 
operands and operators. For example the length of —up is 1 + lengthfp). 

A and E are the universal and existential path quantifiers, while X (neXt) 
and U (Until) are the linear-time modalities. Composition of formulae of the 
form tp\ V <^ 2 ) Pi — > Pi, and the modalities F (Finally) and G (Generally) can 
be defined in terms of the CTL formulae: F(cp) = U(true, <p), G(<p) = — >F(— 1 ^ 3 ) 
(cf. [49]). 

A path (called fullpatlr in [49]) in a KTS /C = (£, Act, TZ, I) is an infinite 
sequence 7 r = (no, a 0 , n\,a\, n 2 , 02 , . . .) of states and actions ( 7 q denotes the 
7-th state in the path n) s.t. for all* £ N it holds that 7r i £ £ and 

— either (ni, ai, 7Tj+i) £ TZ, with o* £ Act, 

— or there are not outgoing transitions from ni and for all j > i it holds that 
aj is the special action 0 which is not in Act and nj = 7q. 

Definition 3.1.3. Satisfaction of a CTL formula by a state s of a KTS 
K, = ( £,Act,TZ,I ) is defined recursively as follows: 

— If p € 77, then s \= p iff p £ I (s). Moreover, s |= true and s [A false; 

— s |= f>i A (j) 2 iff s |= and s |= f> 2 ; 

— s |= EX a (4>) iff there is a path n = (s,x,ni, . . .} s.t. x £ a and n\ |= (j>; 

— s |= kX a ((j)) ( a C Act) iff for all paths n = (s,x, n\, . . .} s.t. x £ a, it holds 
that 7Ti |= 4>; 

— s |= EU a ((^i, (j> 2 ) iff there is a path n = (noffo, n\,i\ . . .}, and 3 j £ N s.t. 

7To — S ) 

— nj |= 4 > 2 , and 

— (V* < j) ( 7Ti |= (f 1 and ii £ a); 

— s |= AU a ((/>i, fa) iff for all paths n = (no, £ 0 , n±, t\ . . .} such that no = s, 
3 j £ N s.t. 

— nj |= <j >2 and 

— (Vi < j ) (nj ]= <fi and tj£ a). 

If K. = (£, Act, 1Z, I) is a KTS and ip is a CTL formula, we say that K, \= ip 
if and only if for all s £ I it holds that s \= ip. 



1 When a is of the form {m} for a single action m, we simplify the notation writing 
AX m , EX m , AX m , EX „, . 
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There are two ways in which the model- checking problem can be specified: 

Definition 3.1.4. The local model- checking: given a KTS 1C, a formula tp, 
and a state s of 1C, verifying whether s |= tp. The global model- checking: 
given a KTS 1C, and a formula <p, finding all states s of 1C s.t. s\= tp. 

The above problems are usually formulated for £ finite. In this case, the 
global model-checking problem for a CTL formula tp can be solved in linear 
running time on \tp\ • (|A| + \H\) [27]. Note that the solution of the global 
model-checking problem comprises the solution of the local problem. 



3.1.2 A Linear Time Algorithm to Solve the Mo del- Checking 
Problem 

In this section we report the algorithm introduced in [27] which has time 
complexity linear in both the size of the specification expressed in CTL and 
the size of the Kripke Structure modeling the concurrent system to analyze. 

Assume that we want to determine whether the formula ip is true in the 
KS K. = (£,TZ,I). The algorithm is designed to operate in more steps: the 
first step processes all subformulas of tp of length 1, the second step processes 
all the subformulas of length 2, and so on. At the end of the ith step, each 
state will be labeled with the set of all subformulas of length less than or 
equal to i that are true in the state. The expression label(s) denotes this 
set for the state s £ £. When the algorithm terminates at the end of step 
n = lengthitp) , we obtain that for all states s £ £, s \= f iff f £ label(s), for 
all subformulae f £ tp. 

Note that the length of a formula is determined by counting the total 
number of operands and operators. 

In the description of the algorithm we use the following primitives for 
manipulating formulae and accessing the labels associated with states: 

— argl(tp) and arg2(tp) give the first and second arguments of a two-argument 
temporal operator; thus, if tp = AU(/i,/ 2 ), then argl(tp) = f\ and 
arg2(tp) = f 2 . 

— labeled(s , /) will return true (false) if state s is (is not) labeled with the 
formula /. 

— add-labeled(s, /) adds formula / to the current label of state s. 

We explicitly show the algorithm to solve the model-checking problem for 
the case in which the formula is tp = AU(/i, f^), because the other cases are 
either simpler or similar [27]. In the case we are presenting, the algorithm 
(i.e. the procedure label-graph (tp)) uses a depth-first search to explore the 
state graph. 
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procedure label-graph(</?) 

begin 

{ main operator is AU } 

begin 

ST := empty-stack; 

for all s £ £ do marked(s) := false', 

L: for all s £ £ do 
if -imarked(s) then au(tp, s, b) 

end 

end 

In the procedure label-graph (ip), ST is an auxiliary stack variable and the 
boolean procedure stacked(s) indicates whether state s is currently on the 
stack ST. The recursive procedure au(tp, s, b) performs the search for formula 
tp starting from state s. When au terminates, the boolean result parameter 
b will be set to true iff s |= tp. The code for procedure au is the following: 

procedure au(tp, s , b) 

begin 

Assume that s is marked. If s is already labeled with p, we set b 
to true and return. Otherwise, if s is on the stack, then we have 
found a cycle in the state graph on which argl(tp) holds but tp is 
never fulfilled. Thus we set b to false and return. Otherwise, we have 
already completed a depth-first search from s, and tp is false at s; so 
we must also set b to false and return in this case. Note that there is 
no need to distinguish between the last two cases, since the action is 
the same in each case. 

if marked(s) then 
begin 

if labeled(s,tp) then 

begin b := true', return end; 
b := false; return 
end; 

Mark state s as visited. Let tp = AX(/i,/ 2 ). If / 2 is true at s, tp is 
true at s; so label s with tp and return true. If fi is not true at s, the 
tp is not true at s; so return false. 

marked(s) := true; 
if labeled(s, arg2(tp)) then 

begin add-labeled(<^, /); b:=true; return end 
else if ~^labeled(s, argl(tp)) then 
begin b := false; return end; 
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At this point of the algorithm we know if /i is true at s and that /2 
is not. Check to see if ip is true at all successor states of s. If there 
is some successor state si at which ip is false, the ip is false also at 
s; hence remove s from the stack and return false. If if is true for 
all successor states, then ip is true at s; so remove s from the stack, 
label s with ip, and return true. 

pushes, ST)- for all si £ successors(s) do 

begin 

au(tp, si, 61) 

if ->61 then 

begin pop(ST); b := false-, return end 

end; 

pop(ST)-, add-labeled ( s, ip); b:=true m , return 
end of procedure au. 

In order to handle an arbitrarily CTL formula ip, one has to successively 
apply the state labeling algorithm described above for the case AU(/i,/ 2 ) 
to the subformulae of ip, starting with simplest subformulae and working 
backward to ip: 

for / := lengthfip ) step -1 until 1 do 
label-graplr(/); 

As proved in [27], since each pass through the loop takes time 0(| A|-|-|7vl|), 
where £ is the set of states and 7Z is the transition relation of the KS, we 
conclude that the entire algorithm requires 0(length(ip ) x (|A| + |7\1|)). 

We recall that a model-checker is a procedure that decides whether a given 
structure 1C = (£, 1Z, I ) is a model of a formula ip, i.e. whether K, satisfies <p 
(written 1C \= ip). Moreover, if K. = (£, 1Z, I) is a Kripke Structure, we say 
that 1C ]= ip if and only if Vi £ I : i |= ip. 

Thus, given a Kripke Structure 1C = (£,TZ,I) and a CTL formula ip, we 
can say that a model-checker is an algorithm working in the following way: 

1. we apply the procedure label-graph (ip) on /C; 

2. if Vz £ I it holds labeled(i,tp), then the answer of the model-checker is 
“/C satisfies ip" , otherwise we conclude that “/C does not satisfy ip" . 

3.2 Syntax of the Query Language W 

In this section we describe the syntax of the language W, a very simple graph- 
based language that we will use to characterize queries that have an intuitive 
temporal-logic interpretation. This language is very similar to G-Log [81] (see 
also Chapter 2): on one hand, it is more simple because in our approach 
we do not consider the possibility to generate new relations by using rules, 
on the other hand, we include in the language the possibility to express the 
universal quantification in a very compact way (cf. Sect. 3.7.3). 
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Definition 3.2.1. A W-graph is a directed labeled graph (N,E,£), where N 
is a (finite) set of nodes, ECNx(CxC)xN is a set of labeled edges of 
the form (m, label, n), £ is a function £ : N — > C x (£U {_L}) . _L means 
‘undefined’ (or dummy ), and 

— C = { solid , dashed } denotes how the lines of nodes and edges are drawn. 

— C is a set of labels. 

£ can be seen as the composition of the two single-valued functions £c and 
£c- With abuse of notation, when the context is clear, we will use £ also for 
edges: if e = (m, (c, k),n), then £c(e) = c and £p(e) = k. Two nodes may be 
connected by more than one edge, provided that edge labels be different. 

Definition 3.2.2. If G = ( N,E ,£ ) is a W-graph, then 

— G s = (N s , E s , £\n s ) is the solid subgraph of G, i.e.: 

N s = {n € N : £c{n) = solid} 

E s = {(m, (solid, £),n) £ E : m,n £ N s j 

— the size of G is |G| = \N\ + \E\ 

— given two sets S,T C N , T is accessible from S if for each n £ T there is 
a node m £ S such that there is a path in G from m to n. 

Definition 3.2.3. A W-instance is a W-graph G such that f'c(e) = solid for 
each edge e of G and £c(n) = solid and £c(n) ^ -L for each node n of G. 

Definition 3.2.4. A W-query is a pointed W-graph, namely a pair ( G , v) 
with v a node of G. A W-query (G,u) is accessible if the set N of nodes of 
G is accessible from {v} (the point). 

See Fig. 3.2 for some examples of W-graplrs. Dashed nodes and lines are 
introduced to allow a form of negation. The meaning of the first query is: 
collect all the teachers aged 37. The second query asks for all the teachers 
that have declared some age (note the use of a dummy node, i.e. a node 
labeled _L). The third query, instead, requires to collect all the teachers that 
teach some course, but not Databases. 




Figure 3.2. Three W-queries and a W-instance 



Note that the W-queries in Fig. 3.2 correspond to Very Simple Queries of 
G-Log (cf. Sect. 2.3.4). 
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3.3 W-Instances as KTS 

In this section we show how to build the KTS associated with a W-instance. 

Definition 3.3.1. Let G = ( N,E,£ ) be a W-instance; we define the KTS 
Kg = {Eg, ActGjKG^Ia) over IIq as follows: 

— The set of atomic propositions IIg is the set of all the node labels of G: 

n G = {p : (3 n G N)(p = ic{n)} 

— The set of states is Eq = N. 

— The set of actions Acta includes all the edge labels. In order to capture the 
notion of before we also add in Acte actions for the inverse relations and 
for the negation of all the relations introduced. Thus , if m n belongs to 
E we add the actions p,p -1 ,p, p -1 . We define two sets: 

AcIq = {p,p -1 : (3 to G iV)(3n G N){{m,p,n) G E)} 

Act a = {q, q ■ q G Act^.} 

Moreover, we can assume, for each state s with no outgoing edges in E (a 
leaf) to add a self-loop edge labeled by the special action Q that is not in 

Act-c- 

— The ternary transition relation IZg is defined as follows: let Eq = E U 
{(n,p -1 ,m) : (m,p,n) G E}. Then 

7 Zq = Eq U {( m,p , n) : to, n G N,p G Act q, ( m,p , n ) ^ E}. 

— The interpretation function Tq can be defined as follows. In each state n 
the only formulae that hold are the unary atom lc(p) and true: 2 

loin) = {true, £c(n)} 

Observe that: \R,g\ = \ AcIq\- |IV| 2 < \E\ ■ |IV| 2 since for each pair of nodes, 
exactly one between q or q holds, for q = p or q = p -1 , q G IIg. 

Example 3.3.1. Consider the graph G = ({m, . . . , ns}, E), of Fig. 3.2. Then: 

— IIg = { Teacher, Course, Student, 37, 40, Databases, Smith} 

— E g = {m, . . . ,n 8 } 

— Acte = { teaches, teaches -1 , teaches, teaches -1 , attends, . . . 

age, . . . , cName, . . . , name, . . .} 

AcIq = {teaches, . . . , name, teaches -1 , . . . , name -1 }. 



2 The two unary atoms represent the basic local properties of a system state. Other 
properties can be added, if needed. 
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— Eq is the union of E with the set 

{ (n 3, teaches -1 , ni), (ri3, teaches - ,712), (ri3, attends -1 , 714), 
(n 5 ,age -1 ,ni), (n 6 , age -1 , n 2 ), (n 7 , cName -1 , n 3 ), 

(ns, name -1 , 714) } 

Then TZq is the union of Eg with the “complement” set and the set of 
self-loop edges labeled by 0: 

{(m,p, n ) : m, n £ Eg,P € AcPq A ( m,p , n) ^ Eg}G 
{(m, 0, m) : m € {?z 5 , n 6 , n 7 , n 8 }} 

— The interpretation X is: 

I(n\) = {true, Teacher} 

T(n2) = {true, Teacher} 

0(777) = {true, Databases} 

Z(n 8 ) = {true, Smith} 



3.4 CTL-Based Semantics of W-Queries 

In this section we show how to extract CTL formulae from W-queries. We 
associate a formula <20 (G) to a query (a W-pointed graph) (G,v). Such a 
formula will give us the possibility to define a model-checking based method- 
ology for querying a W-instance. We anticipate this definition in order to 
make the meaning of formula encoding more clear. 

Definition 3.4.1 (Querying). Given aW-instance I and aW-query (G,v), 
let ICi be the KTS associated with I and $V(G) the CTL formula associated 
with (G,u). Querying I with G amounts to solve the global model- checking 
problem for /Cj and i0(G), namely find all the states s of /C/ such that 

s\=MG). 

Remark S.f.l. In order to correctly solve the global model-checking problem 
for a KTS tCi and a formula \P V (G), it is important to explicitly individuate 
the initial states of /C/. In fact, the set of states of /Cj must be accessible from 
the set of initial states (cf. Def. 3.2.2) because model-checking techniques are 
based on algorithms of graph visit. 

3.4.1 Technique Overview 

In order to explain the technique, we start our analysis by considering simple 
queries in which the pointed graph consists of two nodes (Fig. 3.3). Query R\ 
has no dashed part: only positive information is required. Its meaning is to 
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(Hi) Collect all the teachers 
of some course 



(f?2) Collect all the teachers 
that do not teach all courses 
(i.e., s.t. there is a course that 
they do not teach) 



i 



teaches 



Course 



i 



teaches 



Course 



(R3) Collect all the teachers 
that teach no courses (i.e., for 
each course they do not teach 
it) 



(H4) Collect all the teachers 
that teach all courses (i.e., for 
each course they do not not 
teach — they teach — it). 



Figure 3.3. Simple queries 



look for nodes labeled Teacher that are connected with nodes labeled Course 
by an edge labeled teaches. The CTL formula must express the statement “In 
this state Teacher formula is true and there is one next state reachable by an 
edge labeled teaches, where the Course formula is true”, i.e. 

Teacher A EX teac h es (Course) 

The CTL operator neXt (used either as EX or AX) captures the notion of 
following an edge in the graph. Thanks to this operator, we can easily define 
a directed path on the graph, nesting formulae that must be satisfied by one 
(or every) next state. 

Query R2 contains a dashed edge teaches that introduces a negative in- 
formation. i?2 requires the existence of two nodes, and the non-existence of 
one edge labeled teaches between them. We can express this statement by 

Teacher A EX teaches (Course) 

The availability of the negation of the predicate symbol teaches, allows us to 
say that the relation teaches does not hold between two nodes is the same as 
requiring that, between the same pair of nodes, the relation teaches holds . 3 

The meaning of query R3, where there are dashed edges and nodes, is 
rather different. This formula is true if “there is a node labeled Teacher s.t., 
for all the nodes labeled Course, the relation teaches is not fulfilled”. A CTL 
formula that states this property is the following: 

Teacher A AX teac h e s (^Course) 

To give a semantics to query i?4, first replace the solid edge labeled teaches 
with the dashed edge labeled teaches. Then use the same interpretation as 
for query R3: 

3 Here, an implicit closed word assumption is made. 
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Teacher A AX^i^-(^Course) 

Its intuitive meaning is: “it is true if Teacher is linked by edges labeled teaches 
to all the Course nodes of the graph. Note the extremely compact way for 
expressing universal quantification. 



3.4.2 Admitted Queries 

We will show how to encode W-queries in CTL formulae. The equivalence 
result with G-log (see Sect. 3.7.3) and the NP-completeness of the subgraph 
bisimulation problem ([44]) prevents us to encode all possible queries in a 
framework that can be solved in polynomial time. We will encode four families 
of queries Q = (G, p): 

— Q is an acyclic accessible query (Sect. 3.4.4). 

— G is an acyclic solid graph (Sect. 3.4.4). 

— G is an acyclic graph and after the application of a rewriting procedure, it 
becomes acyclic and accessible from {p} (Sect. 3.4.4). 

— G is in one of the four families above but some leaf nodes are replaced by 
simple solid cycles (Sect. 3.4.5). 



3.4.3 Query Translation 



As initial step, we associate a formula tp to each node and edge of a graph 
G. Then we will use this notion for the definition of the formula. 



Definition 3.4.2. Let G = ( N,E ,£ ) be a W -graph. 
— For all n £ N we define the formula p(n) as: 



<P(n) 



t c{n ) iflc(n)^± 
true otherwise 



— For all edges e = (ni, {c,p),n 2 ) £ E, the formula p(e) is defined as: 



<p(e) = 



P 

P 



if £c(e) = tc{n 2 ) 
otherwise 



3.4.4 Acyclic Graphs 

Let us assume that G is an acyclic graph. 

Definition 3.4.3. Let G = ( N,E,£ ) be an acyclic W -graph, and v £ N. 
Then the formula F„(G), depending on both G and v is defined recursively 
as follows (cf. Fig. 3.4^: 

— let bi, . . . ,bh (h> 0) be the successors of v s.t. £c(bi) = solid, 

— let ci, . . . ,Cfc (k > 0) those s.t. ic{ci) = dashed. 




3.4 CTL-Based Semantics of W-Queries 



63 




Figure 3.4. Graph for computing \t/ v (G). 



— for i = 1, . . . , h and j = 1, ... ,k let e* be the edge which links v to bi and 
e' the one which links v to Cj. 

1. If Iciy) = solid, then 

MG) = viy) A A EX v(ei) (^(G))A A AX v(e ,)0 t Cj [G)) 

i=l...h j=l...k 

2. else (IcW) — dashed) 

^(G) = ^)V V A X v(ei) (^(G))V \/ E X vWj) {* Cj {G)) 

j—l...k 

Given a graph G, let G be the graph obtained from G by complementation 
of the colors of edges and nodes (solid becomes dashed and vice versa). It is 
immediate to prove, by induction on the depth of the subgraph of G that can 
be reached from a node v, that 'T V {G) = ~^ V {G). Moreover, it is immediate, 
by the recursive definition of the formula, and by the acyclicity of G, that 
the formula built is a CTL formula (cf. Def. 3.1.2). 

Definition 3.4.4. Let G = (N,E,£) be an acyclic W -graph, v € N, and 
Q = (G,v) be an accessible query. The formula associated to Q is I / V {G). 

Remark 3. f.2. Observe that: 

1. Each node and edge of G is used to build the formula. 

2. The size of the formula T' l ,(G) can grow exponentially w.r.t. |G|, since the 
construction of the formula involves the unfolding of a DAG. However, 
it is only a representation problem: subformulae can be repeated many 
times. It is an easy task to compute the formula avoiding the repetitions 
and keeping the memory allocation (and the execution time) linear w.r.t. 

\G\. 

The condition on the accessibility of all nodes of G for v can be weakened, 
at least for some cases. Consider, for instance, the two goals of Fig. 3.5. 
If works is the inverse relation of gives salary (i.e. gives salary -1 ), then one 
expects the same results from query (a) and query (6). Thus, the idea is 
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(a) (fc) 

Figure 3.5. Two equivalent acyclic queries 



to swap the direction of some edges, replacing the labeling relation with its 
inverse. 4 This procedure can be automatized: 

Algorithm 3.4.1. Let G = ( N , E, £) be an acyclic solid W -graph and v G N. 

1. Let G = ( N,E ) be the non-directed graph obtained from G defining: 

E = {{to, n} : (to, £, n) G E} 

2. Identify each connected component of G by one of its nodes. Use v for 
its connected component. Let pi, • • • > Ph be the other nodes chosen. 

3. Execute a breadth-first visit of G starting from v, then from nodes 
Pi i ■ • • j Ph ■ 

4- Consider the list of nodes v = no, n \, . . . , nk ordered by the above visit. 

5. Build the W -graph G = (N,E,£) from G by swapping the edges that are 
not consistent with the above ordering: 

E = (E \ {(n a , (c,p),n b ) G E : b < a})U 

{(n b , (c,p _1 ),n a ) : (n a , (c,p),n b ) G E A b < a} 

The above algorithm always produces an acyclic graph, since the edges 
follow a strict order of the nodes. All the nodes of each connected component 
of G are accessible from the corresponding selected node (u, pi , . . . , ph) by 
construction (they have been reached by a visit). The Algorithm 3.4.1 can 
be implemented in time 0(|iV| + |-E|). 

Thus, for each node u, pi, . . . , ph we can compute the formulae 

^(G),^ 1 (G),...,^(G) 

We recall here the semantics of the formula EF a (<^) we will use in the following 
definitions (see also Sect. 3.1): 

Given a state s, s \= EF Q (<^) iff there is a path (no, £q, 7Ti, l\ • • •), s.t. no = s 
and 3j G N s.t. nj |= <j) and (Vi < j ) ti G a. 



4 Recall that in the KTS associated to a W-instance, inverse relations for all the 
relations involved occur explicitly: the framework is already tuned to deal also 
with this case. 
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Definition 3.4.5. Let G = (N, E, I) be an acyclic solid W -graph and v € N. 
Let Q = (G, v) a W -query. The formula associated with Q is: 



MG) A EF Acta (M(G)) A • • • A EF Act jM(G)) 
where Acta is as in Def. 3.3.1. 

Observe that the Algorithm 3.4.1 terminates even if G admits cycles (save 
self-loops). However, in these cases, the semantics of the query is lost. Cyclic 
queries require different modal operators (see Subsection 3.4.5). 

Let us study another family of acyclic queries that can be handled, via 
reduction to the accessible query case. 

Definition 3.4.6. Let G = ( N,E ,£ ) be an acyclic W -graph, let v £ N , 
i c {v) = solid, and Q = (G, v) be a W -query. 

1. Let G s = (N s , E s , £\n b ) its solid subgraph (cf. Def. 3.2.2/ 

2. Apply the Algorithm 3.4.1 to G s . Let u, p\, . . . , ph be the root nodes cho- 
sen. 

3. Swap in G the same edges that have been swapped by the Algorithm 3.4.1 
in the step 2, obtaining the graph G. 

f. If G is acyclic, then compute the formulae 

MG),M(G),...,M(G) 

5. If all the nodes of G have been visited during the last phase, then the 
formula associated with Q is: 

MG) A EF ActG (M(G)) A • • • A EF Act jM(G)) 

Let us explain why we applicate the algorithm only to G s . Consider, 
for example, the query (a) in Fig. 3.6. According to Def. 3.4.6 the formula 
associated with it is: 

Teacher A AX^teaches(^Course) A EF^ C ^ (Student A AX^ atte nds( _, Course)) 

which requires to find all those Teachers who teach all the Courses, if there is 
somewhere a Student who attends all the Courses. 

The Algorithm 3.4.1 should replace the edge labeled attends in the query 
(a) with an edge labeled attends' 1 , as depicted in the query (6) of Fig. 3.6. 
The formula associated with this query is: 

Teacher A AX^ teac h es (^Course V AX attends -i Student) 

The two formulae have different models. The first is closer to the interpreta- 
tion of graph-based formulae in other frameworks (e.g. G-Log). 
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3.4.5 Cyclic Queries 

In this section we extend the technique assigning a temporal formula to 
queries admitting cycles. We make use of the Generally operator, used as 
EG a (</>), whose semantics is (see also Section 3.1): 

Given a state s, s |= EG a {(f) iff there is a path i r = (ttq, £o, 7Ti, . . .), 
s.t. 7 Tq — s and Vj G N (nj |= (f and £j G a). 



Teacher 


— 


Teacher 




\ teaches 


\ teaches 


Course 




Course 




^ attends 


^ attends 


Student 




Student 





(a) 




(G) 



m 




(G) 



Figure 3.6. W-queries and W-instances 





calls 




Class 


defined 


Function 



(c) 



Consider the query (c) in Fig. 3.6. It requires to collect all the classes which 
call a function defined inside themselves. This property can be expressed by 
using the much richer temporal logic CTL*, which extends CTL by allowing 
basic temporal operators where the path quantifier (A or E) is followed by 
an arbitrary linear-time formula, allowing boolean combinations and nesting, 
over F, G, X, and U [49]. Precisely, given a set „4cf of actions and a set 
II of atomic propositions, the set Lit of literals is defined as Lit = II U 
{-9 \ Q G n} U {true, false}. State formulae (f and Path formulae if are 
inductively defined by the following grammar, where p G Lit and a C _4cf: 

state formulae: ::= p \ (j) A (f \ $ V 4> \ Aif> \ E ip 

path formulae: if ::= (f \ if A if \ if \/ ip \ G a if \ F a if \ X a if \ 

U a(if,if) 

The CTL* formula for the query (c) in Fig. 3.6 is: 

!F( c ) = Class A EX ca || s (Function A EXdefined (Class)) A 

EG{ ca || S: defined} ((Class V Function) A Class — > X ca || S (Function) A 

Function — > Xdefined (Class)) 
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Observe that the modal operator X is used without any path quan- 
tifier. This is allowed in CTL* but not in CTL. In particular, a path 
7 r = (no, £o, tti, £\ . . .) satisfies the formula X a f/> when £o £ a and holds 
starting at the second state 7 Ti on the path. 

The first part of the formula T( c \ above is aimed at identifying a path of 
length greater or equal that two, the Generally operator imposes to retrieve 
only cyclic paths where nodes labeled Class alternate with nodes labeled 
Function. 

With the CTL logic it is only possible to approximate the translation of 
this kind of cyclic queries. In fact, a CTL formula for the query (c) in Fig. 3.6 
could be: 

= Class A EX ca || s (Function A EXdefined (Class)) A 

EGicaiis defined} (Class r EX ca n s (Function A EXdefined (Class) ) ) 

The node Class of instance ( I \ ) in Fig. 3.6 satisfies both the formulae 'T( c - j 
and i P{ c y Instead, the node rii of instance ( I 2 ) in Fig. 3.6 satisfies but 
not 'L'(c)- Thus, the CTL translation gives only an approximation of the CTL * 
one. 

Since the model-checking problem for CTL * is PSPACE complete, we 
accept this loosing of precision, and we assign a CTL formula also to a generic 
(pointed) cycle. 

Definition 3.4.7. The formula T Vl (G) associated to a red cyclic graph 
G= ({vx, . . . ,v n } , {(vi, ^, 1 * 2 ) , (v 2 A 2 ,vf) , . . . , (v n , tn,Vi)} , £,Vi) , 
is defined as: 

T V1 (G) — tpc A -> ip c ) 

where </?(iq) is as in Def. 3.4.2, and ifc is the CTL formula Vl (C) associated 
to the DAG 

C = ({iq, . . . , v n +i], {Wuh, v 2 ), (V 2 J 2 , v 3 ), (n n ,£ n , v n+ 1 )}), 
rooted at iq and where v n +i is a “copy v of zq, i.e. a new node s.t. £(u n+ 1 ) = 

^>i). 

The formula ’ifc is used twice for the sake of simplicity. In its second usage 
it could be shortened, e.g. by removing the formula of the first node. 

Note that a special case of generic cyclic queries are simple cycle queries 
(e.g. queries shaped as rings) composed by nodes with a unique label £ n and 
edges all labeled £ e . This kind of cyclic W-queries can be exactly translated 
into CTL. 

Consider for example, the cyclic query (a) in Fig. 3.7, it requires to collect 
all the functions which are recursive. Its behavior could be described by: 
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(«) (<>) (c) 



Figure 3.7. Three bisimilar simple cyclic queries 



Function A EX ca || s (Function A EX ca || s (Function A EX ca || s (- • •))) 

that can be mimicked exactly by 

EG ca |[ s Function (3.1) 

It is easy to see that this formula (more compact than the formula in 
Def. 3.4.7) expresses the cyclic condition depicted in the W-query (a) in 
Fig. 3.7. In fact, if we apply the procedure in Def. 3.4.7 to the W-query (a) 
we obtain the formula 

Function A EX ca n s Function A EG ca || s (Function — * EX ca || s Function) 

which has the same models of EG ca iisFunction. 

It is easy also to see that queries (a) and (6) (and also (c)) can be consid- 
ered equivalent, their models satisfy the formula EG C aii s Function. This fact has 
a graph counterpart: their pointed graphs are equivalent modulo bisimula- 
tion (for this topic, see Section 3.7.3). (a) is the minimum graph bisimulation 
equivalent (from [5] we know that there is always a unique minimum graph, 
called bisimulation contraction). 

To sum up, consider the W-query (a) in Fig. 3.8. In order to find its CTL 
encoding we have to apply the rewriting procedure to swap the in edge, thus 
obtaining the W-query (6) and the formula: 



ip„ = Class A EX|jj(Class A EX name Math) A EX in -i (Procedure A 
EX ca || s (Fuction A EG ca || s Function)) 

Note that the CTL formula associated to the simple cyclic query rooted 
at n is EG ca ii s Function. 



3.5 Complexity Issues 

Let us state the main computational result of our approach: 

Theorem 3.5.1. Let (G,v) be a W-query in one of the forms described in 
Section 3.4.2. Querying a W -instance I with G is done in linear running time 
on |/Cj| and \'L' U {G)\. 
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(«) (&) 

Figure 3.8. Two W-queries 



The proof of the theorem follows from [27] . 

When computing /C/, a quadratic time and space complexity is introduced 
as negative relations are computed and stored. 

As far as the size of the formula is concerned, as discussed in Remark 3.4.2, 
even if \T V {G)\ can grow exponentially with |G|, it is natural to represent it 
using a linear amount of memory. This compact representation is allowed by 
the model-checker NuSMV. 

As shown in Sect. 3.4.5 for cyclic queries, it seems to be impossible to map 
all the queries in CTL formulae. This fact can be justified algorithmically. 
The semantics of acyclic W-queries without negation can be proved to be 
equivalent to that of G-Log queries without negation (Theorem 3.7.2). If we 
extended this equivalence result to cyclic queries (even without negation) , we 
would provide an implementation of a subset of G-log in which data retrieval 
is equivalent to the subgraph bisimulation problem, proved to be NP complete 
in [44]. This would be in contradiction with Theorem 3.5.1. 



3.6 Implementation of the Method 

We have effectively tested the data retrieval technique presented in the pre- 
vious sections by using the model checker NuSMV [26], which is designed to 
check the satisfaction of specifications expressed in CTL (or LTL) on Kripke 
Structures. To this aim, KTS related to W-instances have been rewritten into 
Kripke Structures, where edges are not labeled (see Remark 3.1.1). The idea 
of the rewriting rule is expressed by the following: 

Algorithm 3.6.1. Given a KTS, replace every labeled edge 

label 

m > n 



by the two edges 



m 



ILL 



n, 



where p is a new node labeled label. 
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Inverse edges are not added by this encoding since the modal operator before 
is not considered in this work and, hence, not generated by the formula 
translation. 

We have tested the three queries of Fig. 3.11 on the W-graplr of Fig. 3.10. 
We have coded the W-instance into a KS. The space of states of the KS is 
determined by the declaration of the state variables (in the above example 
s). s is a scalar variable, which can take the symbolic value nl, . . . , nil (the 
node identifiers in the instance) or el, . . . , el2 (the identifiers of the nodes 
added in order to replace the labeled edges). The transition relation of the 
KS is expressed by assigning, for each value of the variable s, the list of nodes 
that can be reached from it using one edge. The variable i is introduced to 
define the label of each state, identified by the value of the variable s. 



MODULE main 
VAR 

s : {nl, ..., nil, el, ..., el2}; 
lab : {person, company, city, 

works, owns, address, lives}; 

ASSIGN 

init(s) := {nl, n5, n6, nil}; 
next(s) := case 

s = nl : el; s = n2 : {e2, e3}; 
s = n3 : {e3, el2}; s = n4 : n4; s = n5 : {e4, e5}; 
s = n6 : {e6, e7}; s = n7 : e8; s = n8 : n8; 
s = n9 : n9; s = nlO : {elO, ell}; s = nil : e9; 

s = el : n2; s = e2 : n3; s = e3 : n4; s = e4 : n3; 

s = e5 : n4; s = e6 : n4; s = e7 : n7; 

s = e8 : n8; s = e9 : n8; s = elO : n7; 

s = ell : n9; s = el2 : n4; 
esac; 

DEFINE 

£ : = case 

s = nl : person; s = n2 : company; 
s = n3 : company; s = n4 : city; 
s = n5 : person; s = n6 : person; 
s = n7 : company; s = n8 : city; s = n9 : city; 
s = nlO : company; s = nil : person; 
s = el : works; s = e2 : owns; s = e3 : address; 

s = e4 : works; s = e5 : lives; s = e6 : lives; 

s = e7 : works; s = e8 : address; s = e9 : lives; 

s = elO : owns; s = ell : address; 

s = el2 : address; 
esac; 



Figure 3.9. NuSMV format of a W-instance 
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The translation of the queries in the NuSMV format is: 



SPEC 

£ — person & E X(£ = works & EX(f = company)) 
SPEC 

£ = person & E X(£ = lives & EX(£ = city)) 

& kX(£ = works -> ! EX(£ = company)) 

SPEC 

£ = company & EX(£ = owns & EX(f = company 
& EX(f! = address & EX(f! = city)))) 

& EX(t? = address & EX(f = city)) 



(Qi) requires all the Persons who work in a Company and (Q 2 ) all the 
Persons who live in a City and do not work in any Company. Query Q 3 is 
not a tree: it requires to collect all the nodes Company which own another 
Company. Both companies must have address in the (same) City (actually, the 
bisimulation based semantics can only force that they have address in some 
city. For requiring the same city, further information, e.g. the name of the 
city, is needed). 

Note that in the CTL translation of query (Q 2 ) we do not use negated 
edges, in fact the required property can be formalized by using a universal 
quantification (e.g. if a Person works somewhere we require that the place is 
not a Company). However, also in the NuSMV implementation negated edges 
are required and have to be stored to correctly encode queries with negated 
edges between positive nodes. Consider, for example, query (R 2 ) of Fig. 3.3, in 
order to find all the pairs of nodes Teacher and Course that are not connected 
by an edge teaches, we have to add between each pair of nodes without a 
teaches relation, a service edge representing the fact that that relation does 
not hold. Note that we choose to label the new relation ^teaches. In fact 
model-checkers use only paths to traverse a Transition System and thus to 
find pairs (or tuple) of nodes connected with a certain arc (i.e. a transition). 




Figure 3.10. Sample Concrete Graph 
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Figure 3.11. Sample Queries 

NuSMV is a procedure that decides whether a given KS M is a model of a 
CTL formula If', i.e. whether M satisfies SP, and returns as main output only 
true or false. In our application we are interested in finding out the set of 
states in M that satisfy B. To this purpose we have extended the NuSMV 
code in order to solve the global model-checking problem, i.e. to print all the 
states of a given KS that satisfy B. 

Intuitively, given a Kripke Structure /C = (V, 1Z, I) and a CTL formula ip, 
we can say that the extension of the model-checker NuSMV is an algorithm 
working in the following way: 

1. we apply the procedure label-graph(ip ) on /C (see Sect. 3.1.2); 

2 . for each s € £ such that labeled(s,p), the output of the model model- 
checker is “The state s satisfies the property tp” . If Vs £ £ it holds 
-> labeled(s,p ), we conclude that “There are no states that satisfy 99 ”. 



For the examples in Figure 3.11, we have obtained the answers: 




Moreover, we have effectively tested on real-life size examples that our 
extension of the NuSMV system runs in linear time in both the sizes of the 
KS and B. 
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3.7 Applications to Existing Languages 

In this section we apply the model-checking based data retrieval approach 
to some existing query languages for semistructured data, such as UnQL, 
GraphLog, and G-Log. 



3.7.1 UnQL 

We will show that UnQL databases can be immediately mapped to W- 
instances (see Section 2.9.1 for a brief description of the language). Then 
we will show how to encode UnQL queries into modal formulae. 

Algorithm 3.7.1. Let G = ((N, E, £}, v) be an edge-labeled (UnQL) graph 
rooted at v with name DB. Assume, w.l.o.g., that DB ^ £(E). We define the 
W -graph G' = (N',E',£') rooted at v as: 

1. E' = E is the set of edges; 

2. N' = N is the set of nodes; 

3. (!(y) = DB; 

f. £'(m) = _L if m € N \ {v}. 

G' is rooted since G is rooted. Thus, we can use the root node (labeled 
DB) to identify the instance of a database. 

Remark 3.7.1. UnQL databases can also be translated into Kripke Structures 
(see Remark 3.1.1). In fact, the label of each edge ( n,p , m) can be associated 
to the arrival node m. 



DB 




DB (a) 



Figure 3.12. An UnQL graph, called DB, and its translation 



For example, the UnQL graph DB in Fig. 3.12 is rewritten as the W- 
graplr (a) by labeling the set of nodes. The graph (a) can be seen as a Kripke 
Transition System K. = (£, Act, 1Z, /), where E is the set of bullet nodes, 
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Act = {i?l, -Rl 1 , Rl, . . .}, the transition relation 1Z is the set of edges, and 
the interpretation function / : S — » p({_L, number, true, false}) is defined as 
follows: 

t( \ _ [ {■£(«), true} if s is not a leaf 

' ' 1 {£{s), number, true} otherwise 

The atom number is used to model the fact that in some queries we need 
to know if a state of the Kripke structure contains a numeric value (i.e. its 
label is a number). 

Let us consider the examples of UnQL query introduced in Section 2.9.1 
on the DB database in Fig. 3.12 and their translations in temporal formulae. 
We will use the modal operator B (Before) (not included in the logic CTL), 
whose meaning is the following: 

Given a state s and a temporal formula (f>, s \= AB a (j> iff for all paths 
7r = (. . . ,x,£,s, 7Ti ,...), it holds that x \= ip and £ £ a, whereas 
s |= EB a f/> iff there is a path n = (. . . ,x,£, s,_, 7Ti, . . .) such that 
x\ = ip and £ £ a. 

Note that B does not belong to CTL (cf. Definition 3.1.2), but inverse 
edges we introduce in Kripke Transition Systems allow to replace this oper- 
ator with the CTL operator NeXt (X). 

The expression 



select t where Rl => \t <— DB (3-2) 

computes the union of all trees t such that DB contains an edge Rl => t 
emanating from the root. The same concept can be expressed by the temporal 
formula 

T(i) = T A EBri(DB) 

which is true in each dummy (depicted as bullet) state which has as pre- 
decessor the root node DB and the edge between them is labeled Rl, or 
equivalently, by the CTL formula 



i’i l) --LA EX m -i {DB). 



The UnQL expression 

select t where \l => \t <— DB (3-3) 

retrieves pointed nodes of any edge emanating from the root and is translated 
by the formula 

V(2) = J- A EB Act(DB) 

which is true in all the bullet states accessible by means of a unique edge 
from the root named DB. The same UnQL query can be translated in CTL 
by means of the formula 
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'0(2) = -L A EX Act {DB) 

An UnQL queries that look arbitrarily deep into the database to find 
pointed nodes of all edges with a numeric label is the following: 

select {/} where _* =>• \l _ <— DB, isnumber(l ) (3-4) 

The expression (3) can be easily translated into the CTL formula 

0(3) = 1 A number 
which is true in the leaf nodes. 

UnQL can look arbitrarily deep into a database; this feature has an im- 
mediate translation into CTL thanks to the F operator. For example, given 
the W-instances of Fig. 3.2, the CTL query 

Teacher A EF^ c<G Databases 

requires to find Teacher nodes having a path that eventually reaches a 
Databases node. It would be interesting to include this feature also into the 
query language W; one possibility is to use dotted edges to represent paths 
of an arbitrary length. 

The UnQL query 



select {Tup => {A => x, D => z}} 

where Rl => Tup => {A => \x, C => \y} <— DB, (3.5) 

R2 => Tup => {C =>\y,D => \z} — DB 

computes the join of Rl and R2 on their common attribute C and the pro- 
jection onto A and D. It is easy to check that its application to the database 
DB has an empty result. 

UnQL queries expressing a join condition on a graph do not have a cor- 
responding temporal formula, because in propositional temporal logics it is 
possible to identify a state only by using the set of atomic predicates (i.e. local 
properties) that are true in the state. Thus, nodes with the same properties 
cannot be distinguished. 

An intuitive translation for the former query could be the following: 

¥>(4) = -LA EBxct(-L A EE>a(-L A EX C (T A EXac((-L))))V 
-L A EB^cti-L A EBd(T A EXc(-L A EXa c *(-L)))) 

The formula p holds in two states of the W-graplr (b) in Fig. 3.12 (the bullet 
nodes with unique ingoing edge labeled 1 and 6 respectively), and thus, the 
result of its application on the database DB is not correct. 

Although we have not developed an algorithm to translate UnQL queries 
into CTL formulae, we can conclude that our approach allows to correctly 
deal with join-free queries of UnQL. 
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3.7.2 GraphLog 

GraphLog is a query language based on a graph representation of both data 
and queries [33] . Queries represent graph patterns corresponding to paths in 
databases (see also Section 2.9.2). 

For example, the graph I in Fig. 3.13 is a representation of a flights 
schedule database. Each flight number is related to the cities it connects (by 
the predicates from and to, respectively) and to the departure and arrival 
time (by the predicates departure and arrival). 

GraphLog represents databases by means of directed labeled multigraphs 
which precisely correspond to W-instances. GraphLog databases are not re- 
quired to be rooted and therefore the corresponding W-graphs are not nec- 
essarily rooted. 




(?) (G) 

Figure 3.13. GraphLog representation of a database and a query 



In order to apply model-checking techniques also to this language, we show 
how to encode GraphLog queries into CTL formulae. For example, graph (G) 
in Fig. 3.13 requires to find in a database two flights FI and F 2 departing 
from and arriving to the same city C, respectively. The edge ‘result’ will 
connect FI and F 2. The CTL formula associated with the non-costructive 
part of this query (i.e. the query without the ‘result’ edge) is: 

F{G) = Flight A EX to (City A EX from _i Flight) 

(which intuitively describes the node FI). We remark here on the fact that 
by inverting the from edge we avoid to represent the join condition that is 
present in the query. F(G) is clearly satisfied by the state labeled 109. Note 
that the Kripke Transition System associated to the instance (I) in Fig. 3.13 
could be as follows: 5 



5 In the interpretation function, for each state we add some local properties (i.e. 
the set of atomic properties true in each state is not only composed by the label 
of the state). 
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— LIq = { Flight, City, ArrivalTime, DepartureTime, 109, 815, . . .} 



£g = {m, • • • ,ng} 




The interpretation T is: 




X(ni) = 


{true, City, Miami} 


I(n 2 ) = 


{true, Flight, 815} 


I(n 8 ) = 


{true, City, Seattle} 


I(n 9 ) = 


{true, ArrivalTime, 19 : 29} 



The possibility to express closure literals causes some difficulties in the 
translation of query graphs into CTL formulae. Consider for example, the two 
query graphs in Fig. 3.14 representing the relationships ‘parent’ and ‘relative’ 
between people. Their natural translation into CTL should be: 

<T(Qi) = Person A ((EXp arent Person) 

(Person A EX P a re nt( Person A EX re | ative — i Person))) 

'L'iQ 2) = Person A ((EX re | at ive (Person A EX re | a tivePerson)) — ► 

(Person A EX re | at i ve ( Person A EX re i a tive ( Person A 
EX re [ a tjve — 1 Person ) ) ) ) 

It is easy to see that the node n of the instance graph 1 in Fig. 3.14 
satisfies the formula ^(Q 1 ) but does not fulfill the natural interpretation of 
relationships between people. Once again, the problem is that in CTL it is not 
possible to constrain the last unary predicate Person in ^(Q 1 ) to be satisfied 
by the same state of the first unary predicate Person. 

In order to overcome this limitation and to be able to apply model- 
checking techniques to GraphLog, one should first compute the closure of la- 
beled graphs representing databases. Intuitively, computing graph-theoretical 
closure entails inserting a new edge labeled a between two nodes, if they are 
connected via a path of any length composed of edges all labeled a in the orig- 
inal graph (see [42] for an application of graph closure to XML documents). 
Computing the closure is well-known to be a polynomial time problem w.r.t. 
the size of nodes of the graph. 



Person 



parent 



relative 



t 



Person 



Q 1 




Q2 z 



Figure 3.14. Two query graphs and an instance 
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Again, we can state that our approach based on model-checking tech- 
niques allows to correctly deal with join-free queries of GraphLog without 
closure operators. 

The problem of expressing join conditions by means of CTL formulae 
remains open, because in propositional modal logics framework there is no 
way to force that two distinct states with the same set of local properties are 
really the same state. In fact, the propositional modal logic semantics forces a 
bisimulation equivalence between subgraphs fulfilling the same formula. This 
allows us to say that two nodes with the same properties are equivalent but 
not to require that they are the same node and thus, this is the reason why 
we cannot model correctly the join operation. 

A possibility to solve this problem is to add to the set of local properties of 
each state a sort of identifier of the state: in this way, by supposing the set of 
all states to be finite (actually, this restriction is true in the database context) 
one could express join conditions. For example, the W-query in Fig. 3.15 
requires to find all the Students attending a Course together with at least 
one friend. If we know that the set of identifiers of Course nodes is {ci, C 2 }, 
we could translate the query with the CTL formula 

/ (Student A EX at tends(Course A Ci)A \ 

EXfriend (Student A EX attends (Course A Ci)))V 
(Student A EX at t en ds(Course A C2)A 
\ EXf rienc | (Student A EX atte nds(Course A c 2 ))) / 

However, this translation causes an explosion of the formula size and it 
requires also to know in advance the set of identifiers of a certain state (in 
the example the Course state). 




Figure 3.15. A W-query with a join condition 



3.7.3 G-Log 

G-Log [81] as GraphLog, is a query language in which data, queries, and 
generic rules are expressed by graphs (see Section 2.1). As far as instances 
are concerned, syntax of G-Log instances is the same as that of W. G-Log 
queries allow more flexibility than W-ones. Nevertheless, often G-Log queries 
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have only one “green” node with an edge pointing to a “red” node v . These 
queries are the same as W-queries, in which v is the node pointed by the 
query. In G-Log two kinds of nodes are allowed, complex and atomic nodes, 
but this feature can be simulated by modifying the co-domain of the label 
function in W. Instead, solid edges are forbidden to enter into dashed nodes 
in G-Log. This feature of W is one of the main points for programming 
nesting quantifications and for expressing in a compact way the universal 
quantification. Only existential requirements are instead allowed in G-Log 
by using a unique query and in order to obtain universal requirements one 
should compose a G-Log program (i.e. a combination of rules). If a W query 
fulfills this further requirement, is a G-Log query. 

Semantics of query application is given in [81] via the notion of graph 
embedding and in this work (see also [37]) by using the notion of subgraph 
bisimulation (both NP complete, also without considering negation in query 
graphs). We prove here that all the G-Log queries that belong to the four 
families of graphs defined in Section 3.4.4 can be correctly solved using the 
model-checking approach. We have already discussed (see Section 3.5) that 
for complexity limits we cannot extend the method to all cyclic queries. 

Theorem 3.7.2 (Correctness I). Let I be a W -instance and {G,u) be a 
W-query without negation and s.t. G I psoZicZ} a rec ^ DAG rooted at u. Then, 
a subgraph I' of I matches with G, and m\, . . . ,m z are in relation with v if 
and only if all corresponding nodes m \ , . . . , m' z in /Cj are s.t. ml i |= <F„(G). 

Proof. Assume that /' ~ G|{#, 5 } and that b is a bisimulation. Since G|{#s} 
is rooted by the node v, this node will be in relation b with some nodes 
mi, . . . , m z of V . For each n £ N let depth[n) be the maximum length of a 
path from n to a leaf. We prove the property by induction on depth(n). 

If depth {v) = 0, the property is trivially true. There are z nodes in / that 
satisfy the requirements of the query. They are exactly the nodes in /C/ in 
which the formulae are true. The same situation holds between the formula 
^(G) and a state of /C/. 

Let depth(v) > 0 and assume that the property holds for the subgraphs. 
We are in the situation of Figure 3.4 with k = 0, because G | | t t as l le d } = ® 
and thus, 

tf„(G) = *>(!/) A A EX^fcjOMG)) 

i=l...h 

For one direction, assume I' matches with G, and mi, . . . ,m z are root 
nodes of I' . Of course they are bisimilar to v. By definition of bisimulation, 
this means that tciy) = ^(m^) for all * = 1, ... , z. Thus, the formula p(v), 
namely lc ( v ) , is true in the states m \ , . . . ,m' z of the transition system /Cj. 
Moreover, by definition of bisimulation, it holds that for any edge {v,ei,bi) 
outgoing from v and for each m, there is in I (at least) one edge outgoing 
from with the same label e* and the node reached is bisimilar to b, 
(and vice versa, for each node m,, ... The analysis of this part is analogous). 
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Bisimulation and the acyclicity of G implies acyclicity of I': we can apply the 
inductive hypothesis to the various pairs of subgraphs of G and of I' identified 
in this way (/3\, ■ ■ ■ ,Ph are the subgraphs of rooted by bi, . . . , bk as 

in Figure 3.4). Thus, by inductive hypothesis, we have that there are nodes 
&s such that & \= ^ (G) and that the node (state) £* in /Cj is in relation e* 
with the node m! i corresponding to the node ?rq bisimilar to v. Thus, each 
formula EX^Mi^G)) is fulfilled by to' , and also is the conjunction. The 
proof of the other direction is similar. 

Theorem 3.7.3 (Correctness II). Let (G, v) be a W-queries consisting 
w.l.g. in a simple minimal cycle with a unique node v and a unique edge 
e, and I = (N,E,£) be a W -instance, and /Cj = {Eg, Acta, TZgi^g) the cor- 
responding KTS. Then, a node n £ N is s.t. n \= ipv(G) if and only if there 

is a relation b s.t. n is the root of a subgraph I' of I, G ~ I' , and n ~ v. 

Proof. V’i'(G) = EG^( e )£(^). We suppose that m £ N is such that n\ |= 
ipv(G) and we construct the relation b. m |= ^v(G) iff £{ni) = £{v) (we let 
b{n\,v)) and 3(ni,ei,n2) G E s.t. £( 11 , 2 ) = t{v) and £{ei) = £(e) (because of 
the semantics of the Generally operator) and this happens in the following 
two cases: 

1. n 2 = ni. It is easy to check that b is an isomorphism, and thus a bisim- 
ulation. 

2. n 2 ni. In this case the relation b is computed following the previous 
steps; in fact, n 2 \= ipv{G) and consequently, it has to be true that 
3(n 2 ,e 2 ,n 3 ) € E s.t. l(nf) = £{v) and ^(e 2 ) = 1(e)- Finally, we find a 
node p such that £(p) = £{v) and p = n\. At this point it is easy to check 
that the relation b is a bisimulation. 

The other direction of the theorem is easily proved by considering a bisim- 
ulation between the query and a subgraph I' of the instance I rooted in a 
node n (iv(G) is trivially satisfied by n). 



3.8 Expressive Power of Temporal Logics 

In this chapter we have shown an approach based on model-checking tech- 
niques to solve either graphical queries or some simple SQL-like queries on 
semistructured data. 

In Figure 3.16 we sum up the main results of the proposed method: 

— W-queries with also simple cyclic conditions can be translated into propo- 
sitional temporal logics. In particular, general acyclic accessible queries, 
acyclic solid graphs and acyclic graphs, which become accessible after the 
application of a rewriting procedure (cf. Section 3.4.4), can be translated 
into CTL and thus, can be solved in linear time on the size of the formula 
and the Kripke Transition System related to the instance. Note that the set 
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Figure 3.16. Complexity of queries w.r.t. propositional temporal logics 



of these W-queries is strictly contained in the set of CTL formulae. Con- 
sider for example the formula AG tp expressing a “safety property” (p that 
must hold in all paths and in every state of the model taken into account. 
We can state that it does not express any natural query on a database, 
because with queries we want to find out the set of nodes possessing a 
given property (e.g. <p) and not to verify that all nodes satisfy <p. More- 
over, we have shown that often cyclic queries have not an exact translation 
into CTL and can only be approximated by using temporal formulae of this 
logic. On the contrary, we have shown that in general, cyclic queries can be 
translated into CTL* and thus, the algorithm for solving them is PSPACE 
complete (cf. Section 3.4.5). We recall that only simple cyclic queries have 
a CTL translation. 

— In Section 3.7 we have shown that G-Log queries can be considered in the 
same way as W-queries, moreover, some other languages for semistructured 
data can be partially handled with the model-checking based approach. In 
particular, simple path expressions without join conditions and aggregator 
operators (e.g. the minimun, the maximun, the average) have a natural 
graphical representation into W-queries and thus, have a CTL translation 
as well. In general, propositional temporal logics cannot express join con- 
ditions or computing aggregator functions on a set of states satisfying a 
given property. 

— Join conditions are correctly expressed with first-order logic formulae and 
thus, we fell that only predicative modal logics [49] can deal with this kind 
of properties. 

— We have not formally proved that all cyclic queries admit a CTL* rep- 
resentation, and thus an open problem is to study whether the subset of 
cyclic queries that cannot be translated neither in CTL nor in CTL* is 
empty. 

Note that our technique can also be applied to XML-based languages such 
as Quilt [19] and XPatlr [96]. The input and output of these languages are 
documents usually modeled by trees. Once again, model-checking algorithms 
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can correctly deal with properties expressing conditions to be satisfied on 
paths but without joins, grouping, or aggregation functions of SQL. 




4. Temporal Aspects of Semistructured Data 



In Chapter 3 we proposed a method based on model-checking techniques to 
efficiently solve (graphical) queries on semistructured data. However, it is 
known that model-checking has emerged in the past years as a promising 
and powerful approach to automatic verification of systems [27, 73, 28], and 
is especially used to verify temporal properties, which are usually expressed 
by means of temporal logics formulae, on reactive systems. Thus, as natural 
consequence of the temporal logic nature, in this chapter we want to adapt 
the technique introduced for accessing static information stored in graphical 
databases, to retrieve temporal aspects of semistructured data as well, such 
as, for example, the evolution of properties related to semistructured data. 

Most applications of database technology are temporal in nature and rely 
on temporal databases , which record time-referenced data. In this setting, 
when history is taken into account, queries can ask about the evolution of 
data through time and constraints may restrict the way changes occur [24, 25] . 
It is clear that this kind of queries is useful whenever data are not static, 
they change over time and their evolution needs to be taken into account. As 
for the classical database field, also in the semistructured data context it is 
interesting to take into account evolution of data through time. 

In particular, in this work we consider several dynamic aspects of semistruc- 
tured data representation, based on different uses of the concept of “valid 
time”. We define a graphical model allowing to represent and to retrieve 
temporal information about semistructured data according to the following 
interpretations of time: 

— Valid Time: in this case data change chronologically and a sequence of 
modifications represents the evolution of facts in a modeled reality with 
respect to time. For example, the salary of an employee changes over time. 

— Interaction Time: is the valid time relative to user browsing. When users 
browse through a document (for example a hypermedia representation) 
they choose a particular path in the graph representing semistructured 
information and in this way they define a personalized order between the 
visited objects. In this context we can create site views depending on each 
user’s choices, that can be useful to personalize the data presentation. 

The relevance of Valid Time is widely accepted, since, as in the classical 
database field ([90, 25, 24]), also in the context of semistructured data it 
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is interesting to take into account the dynamic aspects of data, i.e. their 
evolutions through time and eventually through consecutive updates, in order 
to query past states and impose restrictions on how the data change in time 
[ 21 ]. 

The relevance of Interaction Time is evident because, while navigating 
the Web is simple, finding what the user needs in such a vast amount of 
information is seldom easy. Much time is wasted trying to find information 
and recovering from dead-end searches. It is widely recognized [8] that if users 
could access a customized version of the Web, their task would be made a lot 
simpler and less time would be wasted. User modeling research [63, 100, 9] 
has shown that each individual has a standard pattern of search behavior that 
can be used for user characterization. However, most currently available user 
profiling systems are still based on straightforward logging of the documents 
[71] as well as of the URLs and paths [65] visited by the user. 

A first step in this direction consists of tracing users’ navigation history, 
in order to recognize the habits of each user, e.g. hyperlink selections, time 
spent on a particular page or number of accesses to a specific information. 
A second step is to use and elaborate such information, in order to obtain a 
user model not only based on static declarations of interests and stereotypical 
descriptions but on actual interactions [39]. 

The main aim of this chapter is to present a general model for repre- 
senting temporal aspects of semistructured data w.r.t. both interpretations 
of time. Besides accommodating the usual notion of valid time, adapted to 
semistructured information, this model allows also to store interactions of 
users with Web sites and therefore to provide a sound formal definition for 
user search patterns, based on the natural notion of graph path. Moreover, 
we introduce a SQL-like query language, which, in the same way of TQuel, 
extends SQL to take into account the time coordinate [86]. 

More in detail, the structure of the chapter is as follows. In Section 4.1 we 
introduce the main concepts related to temporal databases and in Section 4.2 
we present a graphical data model suitable to represent static and dynamic 
aspects of semistructured data. In Section 4.3 we show possible changes to 
semistructured databases and in Section 4.4 we introduce a very simple SQL- 
like query language. To conclude, in Section 4.5 we show the applicability 
of our graphical temporal model in the pre-processing tasks of Web Usage 
Mining activities, in Section 4.6 we use the SQL-like query language to extract 
relevant information about users’ interaction, while in Section 4.6.1 we show 
a model-checking approach to solve temporal queries. 



4.1 An Introduction to Temporal Databases 

In this section we give a brief introduction to the main concepts related to 
temporal databases [93, 25]. 
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A temporal database is a repository of temporal information, while a tem- 
poral query language is any query language for temporal databases. The main 
issues to be addressed in this field are the following: 

— Choice of temporal domains : points vs. intervals, linear vs. branching, dense 
vs. discrete, bounded vs. unbounded time. 

In the database context the term instant is used for a time point and the 
point-based view is predominant. Moreover, intervals are obtained as pairs 
of points. Instead, in the interval-based view it is common to have desig- 
nators for interval endpoints, and thus moving between both views is very 
easy. There is not a single approach to represent time in a database, and 
different time domains may be considered. First the time domain may or 
may not be infinite in the past and future. Second, time may be considered 
as discrete, dense or continuous. The most standard temporal domains in 
this context are natural numbers (N, <) or integer numbers (Z, <), but 
it is commonly assumed that time is discrete and isomorphic to natural 
numbers [87]. These two temporal domains are “flat” and thus, in order to 
handle multiple time granularities (e.g. days vs. weeks) it is necessary to 
consider multiple interrelated temporal domains. An instant in a “higher- 
level” domain corresponds to a contiguous set of instants in another “lower- 
level” domain. 

— Data expressiveness of a temporal data model used to represent temporal 
databases, that is the set of temporal databases we can express with the 
model and the space necessary to represent a given temporal relation. 

— Properties of query languages: language design must consider the impact 
of the time- varying nature of data on all aspects of the language, includ- 
ing predicates on temporal values, temporal constructors, modification of 
temporal relations, integrity constraints. The main aspects to be consid- 
ered for a temporal query language are query expressiveness, which data 
structures it is capable to deal with, query evaluation, and efficient imple- 
mentation. Two well-known temporal query languages are TQuel [86] and 
TSQL2 [89, 88], 

A database models and records information about a part of reality (typ- 
ically called modeled reality). Aspects of this “reality” are represented in the 
database by using database entities. The facts (i.e. the logical statements 
that can be assigned a truth value) recorded by the database entities are of 
fundamental interest. 

In general, time spans are associated with database entities, however, sev- 
eral different temporal aspects may be associated with them. In particular, 
the valid time of a fact is the time when the fact is true in the modeled reality. 
Valid time thus captures the time- varying states of the modeled reality [56] . 
Note that all facts have a valid time by definition, however, the valid time 
of a fact may not necessarily be recorded in the database for a lot of reasons 
(e.g. it may not be known, or recording it may be irrelevant for the appli- 
cations supported by the database). The transaction time of a database is 
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the time when the fact is current in the database. All database entities have 
a transaction time aspect but this aspect may or may not be stored in the 
database. Note that transaction time may be associated with any database 
entity, not only with facts. For example, transaction time may be associated 
with objects and values that are not facts because they cannot be true or 
false in isolation (e.g. the “Database” name of a course). The transaction 
time aspect of a database entity has a duration: from insertion to deletion, 
with multiple insertions and deletions being possible for the same entity. As 
a consequence, deleting an entity does not physically remove the entity from 
the database; rather, the entity remains in the database, but ceases to be part 
of the database’s current state. Observe that transaction time of a database 
fact F is the valid time of the related fact “F is current in the database” [55]. 

The main aspects that are used to distinguish between valid time and 
transaction time can be summarized as follows [90] : 

— valid time is characterized as the time when an event occurs in reality; 
transaction time is the time when the data concerning the event were 
stored in the database. 

— A transaction-time value may be added to the database, yet once it has 
been added, it may not be changed. In fact, a time value that records 
when the data were stored cannot later be changed. Valid-time values, on 
the other hand, are always subject to change, since differences between the 
history (a sequence of events or time intervals) as it actually occurred and 
the representation of the history as stored in the database will often be 
detected after the fact. Thus, this distinction is between permitting only 
append operations or arbitrary modification operations. 

— Valid time is generally characterized in the literature as being application- 
dependent, while transaction time is considered to be application-inde- 
pendent and thus, it can be automatically computed by the Database 
Management System. An example for understanding this difference is a 
retroactive salary raise, where the time at which the raise was recorded 
(for example, 12/10/2000) is considered application-independent, as it is 
not under the user’s control, whereas the time at which the raise was to 
take effect (for example, 1/10/2000) is considered application-dependent. 

Valid time and transaction time concepts allow also to characterize the 
features associated with a particular kind of DBMS supporting that time 
concept [90] in the following way. 

Conventional databases model the real world, as it changes dynamically, 
by a snapshot at a particular point in time. A state or an instance of a 
database is its current content, which does not necessarily reflect the current 
status of the real world. Updating the state of a database is performed by 
using data manipulation operations such as insertion, deletion or replacement, 
taking effect as soon as it is committed. In this process, past states of the 
database, and those of the real world, are discarded and forgotten completely. 
We call this type of database a static database. In the relational model, a 
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database is a collection of relations. Each relation consists of a set of tuples 
with the same set of attributes , and is usually represented as a 2-dimensional 
table [3]. As changes occur in the real world, changes are made on this table. 
For example, an instance of a relation “Faculty” at a certain moment may 
be: 



Name 


Rank 


John 

Tom 


full professor 
associate professor 



However, there are many situations where this static database relying on 
snapshots is inadequate. For example, it cannot support historical queries 
and thus, answer to the question “What was Tom’s rank 3 years ago?”. It 
cannot also support retroactive changes or postactive changes, which are 
changes that are not true in the real world yet, but it is known that they are 
becoming true in the future. 

One approach to solve the above problems is to store all past states, 
indexed by time, of the static database as it evolves. This approach requires 
a representation of transaction time, i.e. the time when the information was 
stored in the database. A relation under this approach can be illustrated 
conceptually in three dimensions. By fixing a time instant, it is possible to 
get a snapshot of the relation (a static relation) and make queries on it. 
The operation of extracting a snapshot is called rollback , and a database 
supporting it is called static rollback database. 

Changes to this kind of database may only be made to the most recent 
static state. One limitation of supporting transaction time is that the history 
of database activities, rather than the history of the real world, is recorded. 
A tuple becomes valid as soon as it is entered into the database as in a static 
database. There is no way to store retroactive/postactive changes, nor to 
correct errors in past tuples. Errors can sometimes be overridden (if they are 
in the current state) by they cannot be forgotten. 

A common approach to implement a static rollback database is to append 
the start and end points of the transaction time to each tuple, indicating the 
points in time when the tuple was in the database. For example, a “Faculty” 
relation in this approach looks as follows: 



Name 


Rank 


Transaction time 


Start time 


End time 


John 

John 

Tom 


associate professor 
full professor 
associate professor 


1/1/1990 

1/11/1998 

1/10/1999 


1/11/1998 

now 

now 



While static rollback databases record a sequence of static states, histor- 
ical databases record a single historical state for each relation, storing the 
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history as it is best known. When errors are discovered, they are corrected 
by modifying the database. Previous states are not retained, so it is not pos- 
sible to view the database as it was in the past and there is no record stored 
of the errors that have been corrected. Historical databases must represent 
valid time, the time when the stored information model reality. Historical 
databases may also be represented in three dimensions and to implement 
them it is possible to append endpoints of the valid time to each tuple, in- 
dicating the points in time when the tuple accurately modeled reality. An 
example of the “Faculty” relation reported above is: 



Name 


Rank 


Valid time 


Start time 


End time 


John 

John 

Tom 


associate professor 
full professor 
associate professor 


1/10/1989 

1/10/1998 

3/10/1999 


1/10/1998 

now 

now 



Note that the information that John was promoted full professor on 
1/10/1998 was stored in the static rollback database only one month later 
(the start transaction time is 1/11/1998). 

Benefits of both approaches can be combined by supporting both transac- 
tion time and valid time. While a static rollback database views tuples valid 
at some time as of that time, and a historical database always views tuples 
valid at some moment as of now, a temporal DBMS makes it possible to 
view tuples valid at some moment seen as of some other moment, completely 
capturing the history of retroactive/postactive changes. The term temporal 
database indicates the need for both valid and transaction time in handling 
temporal information. A temporal relation may be thought of as a sequence 
of historical states, each of which is a complete historical relation. The roll- 
back operation on a temporal relation selects a particular historical state, 
on which an historical query may be performed. Each transaction causes a 
new historical state to be created, hence, temporal relations are append-only. 
Now the “Faculty” database looks like: 



Name 


Rank 


Valid time 


Transaction time 


Start time 


End time 


Start time 


End time 


John 


associate 


1/10/1989 


now 


1/01/1990 


1/11/1998 


John 


associate 


1/10/1989 


1/10/1998 


1/11/1998 


now 


John 


full 


1/10/1998 


now 


1/11/1998 


now 


Tom 


associate 


3/10/1999 


now 


1/10/1999 


now 


Tom 


full 


3/10/1999 


now 


5/10/1999 


now 



Note that the relation shows that John started working on 1/10/1989 
but the information was entered into the database on 1/1/1990 retroactively. 
Tom was entered into the database on 1/10/1999 as joining the faculty as 
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an associate professor on 3/10/1999. The fact that he was actually a full 
professor was noted on 5/10/1999. 

Another last notion of time, called user-defined time [59] can be used when 
additional temporal information, not handled by transaction or valid time, is 
stored in the database. For example, for the “Faculty” relation one may be 
interested in the “promotion date” , that is the date shown on the promotion 
letter. In this case the valid date is the date when the promotion letter was 
signed, i.e. the date when the promotion was validated, and the transaction 
date is the date when the information concerning the promotion was stored 
into the database. Note that this notion of time is application-dependent. 

These concepts of classical temporal databases can be applied also to 
semistructured databases, which are usually represented by means of labeled 
graphs instead of flat tables (see [52] for a survey). Thus, the first step for 
studying temporal aspects of semistructured data is to extend graph based 
data models in order to keep trace of historical information. 



4.2 A Graphical Temporal Data Model for 
Semistructured Data 

In this section we introduce a data model based on labeled graphs, useful to 
represent static and dynamic aspects related to semistructured data. We do 
not focus on a particular existing static model because our choice is general 
enough to apply to several approaches presented in the literature [33, 81, 66]. 

In order to represent dynamic aspects of semistructured databases and to 
allow querying about their evolution through time, we add the information 
about the temporal element of objects and relations to their graph. This kind 
of information represents the validity of objects and relations, i.e. the time 
when the objects or relationships are true in the modeled reality [30, 57]. 
We need to consider temporal element instead of simple intervals, in order 
to support “reincarnation” since the death of an object is not necessarily 
terminal. In this way, we are able to derive, for example, if a particular 
information is currently true, or if it was true in a previous state, if it has 
been deleted or updated. 

This approach allows us to store the minimal amount of information, in 
fact we avoid duplicating a lot of nodes and edges (cf. Sect. 5.3.1 of Chap- 
ter 5). 

Definition 4.2.1. A semistructured temporal graph is a directed labeled 
rooted graph (N, E,r,£) , where N is a (finite) set of nodes, E is a set 
of labeled edges of the form (m, label, n), with m,n € N, label € ( T e x 
(C e U {-L}) x (V U {_L})) , r £ N is the root of the graph , and £ : N — * 
(7/U{_L}) x (£„U{_L}) x (5U{1}) x (VU{1|). T means ‘undefined’, and: 

— T n = { complex, atomic } is a set of types for nodes; 

— T e = { relational , temporal } is a set of types for edges; 
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— C n and C e are sets of labels to be used respectively as names for complex or 
atomic objects (nodes), and relations (edges). We require that C n C\C e = 0; 

— S is a set of strings to be used as atomic values; 

— V is a set of temporal elements such as [ti , ^ 2 ) U [£ 3 , £ 4 ) U . . . U [t n _i,£ n ), 
with n > 1, to be used as time intervals for nodes and edges. We use a 
temporal element to keep trace of different time intervals when an object 
exists in the database. 

£ can be seen as composed by four single-valued functions £t„, £c n ,£sffv- 
With abuse of notation, when the context is clear, we will use £ for nodes and 
edges: if e = ( m , (£, k, v), n }, then /r e (e) = £, £c e ( e ) = an d ^v( e ) = v . It is 
also required that: 

— (Vx £ N)(£j- n (x) = complex — * £s(x) = _L) (i.e. values are associated to 
atomic objects only); 

— ( \/x £ N)(£-r n (x) = atomic — » £v(x) = _L) (i.e. temporal elements are not 
associated to atomic objects because in the database the temporal element 
of a value coincides with the temporal element of its unique ingoing edge); 

— (Ve = (m, label, n) € E)(£q- n (m) = complex) (i.e. atomic nodes are leaves); 

— (Ve = (m, (relational, label), n) € E)(£-r n (n) = atomic — > label = ‘‘Has Pro- 
perty'’) (i.e. each atomic node is connected to its parent by an edge labeled 
“Has Property” ) . 

— (Vx £ E)(£j- e (x) = temporal — » £\>(x) = _L) (i.e. temporal elements are 
associated to relational edges only); 

— (Vx £ E)(£-r e (x) = temporal —> £c e ( x ) = “Temporal”) (i.e. the label “Tem- 
poral” is associated to each temporal edge/ 

Note that two nodes may be connected by more than one edge, provided 
that edge labels be different. 

In this work we assume that each node of a semistructured temporal graph 
is unambiguously characterized by an identifier. Moreover, the label of the 
(unique) root represents the name of the database. 

For the sake of readability, in the examples we usually omit the type label of 
nodes and edges. 

Temporal edges are represented by jagged lines. As we introduced in Def- 
inition 4.2.1 we distinguish two kinds of nodes: complex objects, graphically 
represented by rectangles, and atomic objects, depicted as ovals. Complex 
objects are related to other complex objects and possibly “possess” a num- 
ber of attributes (atomic objects), whereas atomic objects represent objects 
with an atomic value (i.e. a string, an integer, but also a text, an image, a 
sound) and do not exist independently of their parent complex object. For 
this reason we choose to replicate atomic objects with the same atomic value 
but connected to different complex objects, although they represent the same 
printable information. This is also the reason why we do not associate an in- 
dependent temporal element to atomic objects. For each complex object it is 
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possible to declare a key , that is a subset of its properties (attributes) that 
identifies in a unambiguous way the object itself. 




Databases Web Databases Software Engineering 



Figure 4.1. A semistructured temporal graph reporting faculty data 



For example, in Figure 4.1 we show a semistructured temporal graph, 
called Department (note that Department is the label of the root node), 
representing information about some teachers and related courses. 

As explained above, we associate a temporal element / = \t\, £ 2 ) U . . . U 
[f n _i,£ n ), with n > 1, to each relational edge and complex object; for each 
time interval [£*,£$+ 1 ) in /, with i £ {1, . . . ,n — 1}, the instant U represents 
the start time (i.e. the “birth” of the object), while U+i, called end time, 
represents the time instant when that piece of information (node or edge) 
ceased to be true in the reality (i.e. the “death” of the object). In the following 
sections, when an entity is currently true, i.e. it is valid, t n assumes the 
constant value “now”. This value can be seen as 00 and it does not cause 
difficulties when it is used as end time, although the time interval is open to 
the right. 
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In the examples we propose later in this work, time granularity depends on 
the application: it may be one day, or one year, etc. However, our approach 
will encode instants of time in a discrete fashion. 



4.3 Operations on Temporal Data 

In this section we describe a set of possible operations that could affect a 
semistructured database and how its semistructured temporal graph is con- 
sequently modified. 

Let G = ( N , E, r, £) be a semistructured temporal graph. We consider the 
following change operations: 

1. Insert the root node 

obj r = insert-root-node(complex,l, [U,tj)) inserts the root node with la- 
bels £ T n (obj r ) = complex , £c n (obj r ) = l, and gives as result the object 
identifier obj r of the root node itself. 

2. Insert a complex object 

objk = insert-complex-node(complex,l, P, [U,tj)) inserts a node with la- 
bels £T n {objk) = complex , £c n (objk) = l, set of attributes P, and gives 
as result the object identifier objk of the node itself. Note that P is a 
set of couples (Label, Value). The valid time of the inserted object is 
[ti,tj). If tj is the constant “now” then the object is current, otherwise 
it is historic. This operation inserts also the set of atomic objects P 
(which represent the properties of the object objk) by calling the opera- 
tion insert- atomic-node for all the pairs (Label, Value) € P, as described 
below. 

In Figure 4.2 we show the insertion of the current complex object 
“Course” and the creation of its atomic object with label “Title” and 
value “Databases”. The valid time of the inserted objects is [01/01/1999, 
now). Note that the valid time of the atomic object is reported on the 
edge labeled “Has Property” between “Course” and “Title”. The insert 
operation is: 

insert-complex-node(complex, Course, {(Title, Databases)}, [01/01/1999, 
now)) 

and returns as result the object identifier of the created node obj 3 . 

After this operation the course node is not reachable from the root and 
thus cannot be retrieved by using TSS-QL. However the node itself can 
be used by other operations and in particular can be connected with a 
path to the root of the graph. 

3. Insert a relationship 

insert-relationship(objk, label, objt, [ti,tj)) adds an edge labeled label be- 
tween objk and objt.. This operation does not have constraints and does 
not make sense between a complex and an atomic object because proper- 
ties, represented by means of atomic objects, are part of complex objects. 
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Figure 4.2. A new course has been added 



We suppose the professor “Mary Jones” starts teaching the course 
“Databases” at time 01/02/1999. In Figure 4.3 we show the new relation- 
ship “Teaches” with valid time [01/02/1999, now), between the complex 
objects “Professor” and “Course”, that we obtain with the operation: 

insert-relationship)obj\, Teaches , objz, [10/02/1999, now)) 

4. Insert an atomic object 

objt = insert-atomic-node(obj p , atomic, l, s,[ti,tj)) creates a child node 
of obj p with labels (. T n i,°bjt ) = atomic, ic n {°bjt) = l, Zs{°bjt) = s and 
£v(objt) = _L. The operation returns as result the object identifier objt- 
Moreover, this operation adds also an edge from the parent complex 
object objp to the atomic object itself, with label “Has Property” and 
temporal element [U, tj). Note that in the valid time context the insertion 
of an atomic object is used to insert new values but also to record an 
update to a value. Indeed, any change which involves a property should 
still keep information about its previous value by adding a temporal edge 
(see the evolution of the “Title” property for the “Databases” Course in 
Fig. 4.1). Note that an atomic object is always connected to a complex 
object, because it represents a property of the complex object, and thus 
the temporal element of its ingoing edge starts after the start time of its 
parent, i.e. the following constraint must hold: t\ < U and tj < t 2 , where 
tv{objp) = [fi,f 2 ). 
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Figure 4.3. A relationship has been inserted 



In Figure 4.4 we show the insertion of the current atomic object “Phone”, 
with value “1234” and temporal element [05/01/1999, now), which is re- 
ported on the edge labeled “Has Property” between “PersData” and 
“Phone”. Supposing that PersData has as object identifers obj$, then 
the insert operation is: 

insert-atomic-node(obj 5 , atomic, Phone, 1234, [05/01/1999, now)) 

and returns as result obje . 

5. Remove an object 

remove-node)objk) removes the node with identifier objk (if it exists). 
Note that if the object is complex we also remove its properties (i.e. 
its atomic objects) and also all the relationships with other complex 
objects. After this operation some portions of the original graph can be 
unreachable from the root node. 

6. Remove a relationship (between complex objects) 
remove-relationship(objk, label, objt) removes an existing edge. 

Note that it is possible to remove a relationship between valid objects. In 
fact, if we remove an edge (objk, label, objt) we do not delete the object 
objt, although it is not reachable, because it is a valid object, i.e. it is 
currently true in the reality. Also after this operation some portions of 
the original graph can be unreachable from the root node. 
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Figure 4.4. A new atomic object has been added 



7. Modify the temporal element of an object 

modify- object-interval(objk , OldTemporal Element , NewTemporalEle — 
merit) checks the valitidy of the constraint imposing that the temporal 
element of atomic objects must be included in the temporal element 
of their parent. If this constraint is satisfied the operation modifies the 
temporal element of the node. 

In Figure 4.5 we modify the temporal element of the complex object 
“Course” and also of its atomic object “Title” (with value “Software 
Engineering”). To modify the temporal element of an atomic object, we 
change the temporal element of its “Has Property” edge. The operation 
for the “Course” object is: 

modify-object-interval(obj 2 , [01/01/1997, now), [01/01/1997,01/11/2000)) 

8. Modify the temporal element of a relation 

modify-relation-interval(objk, label , objt , OldTemporal Element , New — 
Temporal Element) consists of modifying the temporal element of a re- 
lation. 

We suppose the professor “Mary Jones” stops teaching “Software En- 
gineering” at time 01/11/2000. In Figure 4.6 we modify the tempo- 
ral element of the relationship “Teaches”, between “Professor” (“Mary 
Jones”) and “Course” (“Software Engineering”), from the temporal ele- 
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Figure 4.5. A temporal element of an object has been modified 



ment [01/01/1997, now) to the new one [01/01/1997,01/11/2000). The 
operation is: 

modify-relation-interval{obj\ 1 Teaches, 06 / 2 , [01/01/1997, now), 
[01/01/1997,01/11/2000)) 

9. Update a value 

update(objk , old-value , new-value) is used to correct a value: we change in 
the graph the label ic n (°bjk) from old-value to new-value. 

In Figure 4.7 we show the database after the update of the value 
“Databases” which is modified into “Web Databases”. Supposing the 
object identifier of this property is objr, the operation is used for this 
update is: 

update(obji . Databases , WebDatabases) 

The new value is valid from 01/01/2000, thus the operation modify- object- 
interval is also needed. 



4.4 TSS-QL: Temporal Semistructured Query Language 

In this section we present the SQL-like query language TSS-QL [77], syn- 
tactically specified in Section 4.4.1, aimed at querying semistructured data 
modeled by means of semistructured temporal graphs (see Section 4.2). The 
semantics will be informally specified only by examples. 
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Figure 4.7. A value has been updated 
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We use the concept of path expression, which identifies a path on the 
information graph. The path expression has to start and finish with a node 
label and may contain wildcards, which allow the user to access information 
also when their structure is not known. 

The SELECT clause specifies the form of the result of a query by means 
of a list of path expressions. In particular the result is a set of paths in the 
instance graph. 

A path expression may contain different wildcard characters, whose mean- 
ing is the following: 

represents any sequence (eo, Oi, e \, . . . , o n , e n ) with n > 0. Note that the 
length of the path is the number of nodes in the path, and thus, if n = 0 
the sequence is composed by one edge. 

“?” represents a sequence (eo) or (eo, oi, ei). In the former case the wildcard 
is used to represent, between two nodes, a relationship whose name we do 
not know, whereas, in the latter case, it is used to represent an unknown 
path composed by edge-node-edge. 

The FROM clause contains a list of database names (actually, a list of 
root node labels of semistructured temporal graphs) and specifies the name 
of the databases (i.e. their temporal graphs) to be considered as information 
sources. 

The WHERE clause specifies some conditions (possibly linked by con- 
nectives such as AND, OR) on objects or sub-objects matching the SELECT 
clause. It is thus possible to define restrictions on the values of atomic ob- 
jects or on the temporal elements of nodes and edges. Specifically, temporal 
constructs are introduced for referencing the temporal elements of nodes and 
edges (— ► INTERVAL), for testing a particular condition on temporal ele- 
ment w.r.t a time instant (begin, end, begin before, ...), and for the comparison 
of different temporal elements ( precede , overlap, equal). Relations between 
temporal elements are the classical ones recognized and defined in [62]. 

In order to allow the user to specify, during the query composition process, 
the exact structure of the data to retrieve, we include the EXIST operator. In 
a TSS-QL query this operator is followed by the path specifying the structure 
in the graph related to a database of the information of interest. 

4.4.1 Grammar of TSS-QL 

In this section we introduce the grammar for the language TSS-QL by propos- 
ing different SQL-like fragments; in this way, we gradually increase query 
expressiveness. We plan to add new temporal operators to the language and 
the possibility to express updates as well. 

The first fragment, i.e. the “Path Fragment”, is called P-Fragment and 
shown in Figure 4.8. We have called this portion of TSS-QL the “Path Frag- 
ment” because of the nature of the information we can extract (paths in 
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<query>::= SELECT [<label> AS] <select list> 

[ FROM <from list> ] 

[ WHERE [<label> AS] <condition> ] 

[ EXCEPT < query > ] 

<select list>::= <select path>, <select list>| 

< select path> 

<select path>::= <node path>|<attribute t-list path> 

<node path>::= <node label>|<edge path>.<node label> 

<node label>::= <object label>|<attribute label> 

<edge path>::= <object label>.<edge label>| <object label>.<wildcard>| 
<object label>.<edge label>.<edge path>| 

<object lab el >.< wildcard >.< edge path> 

<attribute t-list path> ::= <attribute t-list > | <edge path>.<attribute t-list> 

<attribute t-list> ::= <attribute label>. Temporal. <attribute label>| 
<attribute label> # <attribute label>| 

<attribute label> # <attribute t-list> 

<wildcard>::= * | ? 

<from list>::= <DB label>, <from list > | <DB label> 

<condition>::= <predicate> 

<condition> OR <condition>| 

<condition> AND <condition>| 

<predicate>::=EXISTS <node path>| 

<attribute path> <op> <value> 

<node path> IN <query> 

<attribute path>::=<attribute label>| 

<object path>.<edge label>.< attribute path>| 

<object path>.<wildcard>.<attribute path>| 

<op>::= = | =! = 

The intuitive meaning of the remaining nonterminals is the following: 

<label> a string 
<DB label> a database name 
<object label> a complex node name 
<attribute label> an atomic node name 
<edge label> an edge label 
<value> a constant string 



Figure 4.8. The P-Fragment of TSS-QL 
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the semistructured graph including simple conditions on atemporal as well 
as temporal information. In particular, equality of temporal elements can be 
checked, and temporal edges can be followed in order to investigate on ob- 
jects history). The semantics of this fragment is specified by translating its 
queries into formulae of the logic CTL (see Section 4.6.1). 

With the “Join-Temporal Fragment”, called JT-Fragment, the expres- 
sive power of TSS-QL is augmented in two directions: a) it introduces the 
possibility of explicitly including complex temporal operators in queries, by 
means of complex conditions on the temporal elements of objects and re- 
lationships, for example, conditions to determine whether two temporal el- 
ements are disjoint, partially overlap, are one included in the other, etc. 
(see [62] for the semantics of these conditions), b) it allows comparisons 
between values belonging to different paths. For this last aspect, with the 
JT-Fragment we can express properties that force, for example, the equality 
between values in two different paths of the semistructured temporal graph 
representing the information of interest. This possibility reminds the “join 
operator” of SQL in the classical relational context. 

Note that, however, the “Path Fragment” is already rather expressive. 
Indeed, although explicit comparisons between node values cannot be per- 
formed, many of the joins that would be required if the data were represented 
in the classical relational way are directly expressed by path traversal. 

The JT-Fragment is reported in Figures 4.9 and 4.10. Note that the 
temporal operators, here, include all kinds of comparison. 

With the CA-Fragment, which is reported in Figures 4.20 and 4.21, we 
add the possibility to use SQL standard aggregate operators, such as, for 
example, MIN, MAX, and COUNT. 



4.4.2 Some Examples of TSS-QL Queries 

In this section we give a flavour of the semantics of TSS-QL by showing some 
intuitive examples of queries on a database called Faculty; Fig. 4.1 reports 
a part of its semistructured temporal graph. Note that here time granular- 
ity is “one day” . For each query we report the SQL-like form based on the 
grammar described in Section 4.4.1, and a possible correspondent graphi- 
cal representation that reflects the main features of G-Log (cf. Chapter 2). 
We investigate on the possibility to translate TSS-QL queries into graphical 
queries for applying all the results we have shown in this work for graph-based 
languages. 

In this work we do not give a precise grammar for the graphical version of 
our language, but it is quite intuitive to associate a meaning to the graphical 
counterpart of each specific query. A work which addresses the problem of 
comparing and translating graphs and textual queries is [40]. 
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<query>::= SELECT [<label> AS] <select list> 

[ FROM <from list> ] 

[ WHERE [<label> AS] <condition> ] 

[ EXCEPT < query > ] 

<select list > : : = <select path>, <select list>|<select path> 

<select path>::= <node path>| 

<attribute t-list path> 

<interval path>— > INTERVAL 

<node path>::= <node label>|<edge path>.<node label> 

<node label>::= <object label>|<attribute label> 

<edge path>::= <object label>.<edge label>| <object label>.<wildcard>| 
<object label>.<edge label>.<edge path>| 

<object label>.<wildcard>.<edge path> 

<interval path>::= <object label>|<edge label>|<edge path>] 

<edge path>.<node label> 

<object path>::= <object label>|<edge path>.<object label> 

<attribute t-list path> ::= <attribute t-list >| <edge path>.<attribute t-list> 

<attribute t-list> ::= <attribute label>. Temporal. <attribute label>| 
<attribute label> # <attribute label>| 

<attribute label> # <attribute t-list> 

<wildcard>::= * | ? 

<from list>::= <DB label>, <from list > | <DB label> 



Figure 4.9. The SELECT-FROM clauses of the JT-Fragment 



More in detail, we represent complex objects with thin line rectangles and 
atomic objects with thin line ovals. Each object has a label composed by its 
name and the time interval (note that we represent a generic interval with 
_). The part of the graph depicted with thin lines represents the conditions 
specified in the WHERE clause (i.e. the structure of required information). 
A bold square links the result of the query (i.e. the object specified in the 
SELECT clause). 

Query 1. Find the names of the professors who were working (i.e. teaching 
at least one course) during the whole year 1999. (supposing that the “Name” 
property is a sub-object of a node called “PersData”). 
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<condition>::= <predicate> 

<condition> OR <condition> 

<condition> AND <condition>| 

<predicate>::=EXISTS <node path>| 

<attribute path> <op> <value> 

<node path> IN <query> 

<attribute path> <op> <attribute path> 

<interval path>^ INTERVAL <temporal condition> 

<attribute path>:: = <attribute label>| 

<object path>.<edge label>.<attribute path>| 
<object path>.<wildcard>.<attribute path>| 

<temporal path expression>::= 

<edge label>.<object label> — ♦ INTERVAL 
<edge label> -> INTERVAL 

<temporal condition> ::= <temporal op> <value>| 

<interval opXselect path>.<temporal path expression>| 

< interval op >< interval values > 

<temporal condition> AND<temporal condition>| 

<temporal condition> OR<temporal condition> 

<op>::= = | >|<|>= | =<| =! = 

<temporal op>::= begin | end | begin before | end before | 
begin after | end after 

<interval op>::= precede | overlap | equal 

The intuitive meaning of the remaining nonterminals is the following: 
<label> a string 
<DB label> a database name 
<object label> a complex node name 
<attribute label> an atomic node name 
<edge label> an edge label 
<value> a constant string 
<interval values> a constant interval 



Figure 4.10. The WHERE clause of the JT-Fragment 



SELECT Professor. HasProperty.PersData.HasPropery. Name 
FROM Department 

WHERE EXISTS Professor. Teaches. Course 
AND Professor.Teaches — > INTERVAL 
(begin “01/01/1999” 

OR begin before “01/01/1999”) 

AND end after “01/01/1999” 
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This query requires to find in the Department database the name of those 
Professor objects connected to at least a Course object by means of an edge 
labeled Teaches, such that its temporal element is composed also by an inter- 
val which contains the year “1999”. The EXISTS clause is needed because 
this is a semistructured database, thus in principle no constraint imposes that 
the Teaches edge should end with the node of type Course. 

In the TSS-QL query these requirements are specified through paths, 
which are graphically depicted in Fig. 4.11. Note that the time intervals of 
the reported objects are generic, and thus replaced with the symbol _, while 
the interval of the edge Teaches is specified as described in the WHERE 
clause. 




Figure 4.11. Example of the visual query Find the names of professors who were 
working in 1999 



The result of this query applied to the database depicted of Fig. 4.1 is 
Professor Mary Jones. Note that the query finds the names of those professors 
whose data are structured as the query dictates, i.e., the professors with a 
PersData sub-object, as specified in the SELECT clause. As we said before, 
semistructured data are irregular: some data are missing, as in the case of 
professor John Smith who does not have a Course sub-object, thus does 
not even satisfy the EXISTS clause. It may also happen that a professor 
does not have a PersData sub-object (see Richard Black). Similar concepts 
be represented by using different types (personal data of professor Richard 
Black are structured in different way from personal data of professor Mary 
Jones). In this context flexible path expressions, which allow querying without 
knowing the structure in advance, are needed, thus path expressions in TSS- 
QL are built from labels and wildcards as seen in Section 4.4. The same query, 
composed by using wildcards, is the following: 



SELECT Professor*. Name 
FROM Department 

WHERE EXISTS Professor. Teaches. Course 
AND Professor.Teaches — > INTERVAL 
(begin “01/01/1999” 

OR begin before “01/01/1999”) 
AND end after “01/01/1999” 
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The graphical version of this query is shown in Fig. 4.12. Note that this 
query allows us to find the names of professors who were working in 1999, 
independently of the structure of the personal data. 



<Professor, _ > 




| cTeach es, [begin <= 01/01/1999, end > 01/01/1999)> 
<Course, _ > 



Figure 4.12. Example of the visual query Find the names of professors who were 
working in 1999 



Query 2. Find the names of professors who were teaching the “Databases” 
course during the whole year 1999. 



SELECT 

FROM 

WHERE 

AND 



Professor. HasData.PerData.HasProperty.Name 
Department 

Professor. Teaches. Course. HasProperty.Name = “Databases” 
Professor. Teaches — > INTERVAL 
(begin “01/01/1999” 

OR begin before “01/01/1999”) 

AND end after “01/01/1999” 



\ 




Databases 



Figure 4.13. Example of visual query Find the names of professors who were 
teaching the “Databases” course in 1999 



This query is similar to the previous one, but a restriction on the value 
of the course name is imposed. In this case we do not have the EXISTS 
condition in the WHERE clause because we want to find a precise Course 
object. The graphical version is in Fig. 4.13, where the name of the Course 
object is specified below the atomic object Name. 
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Query 3. Find the current name of the courses which was called “Data- 



bases” at some past time. 




SELECT 


Course. HasProperty.Title 




FROM 


Faculty 




WHERE 


Course. HasProperty.Title — ♦ 


INTERVAL end “now” 


AND 


Course IN 




SELECT 


Course 




FROM 


Faculty 




WHERE 


Course. HasProperty.Title = 


“Databases” 


AND 


Course. HasProperty.Title — ♦ 


INTERVAL end before “now” 



In the WHERE clause we specify that the interval of the Title property 
of the course of interest has to end with the constant “now” , in this way we 
assure that only the current name is retrieved. 




Data Bases 

Figure 4.14. Example of query Find the current name of the Courses which was 
at some time called “Databases” 



Note that in order to find all the names ever assumed by the course called 
“Databases” we have to write the following query: 

SELECT Course. HasProperty.Title 
FROM Faculty 
WHERE Course IN 
SELECT Course. HasProperty.Title 
FROM Faculty 

WHERE Course. HasProperty.Title = “Databases” 



4.5 A Graphical Model for User Navigation History 

In this section we move from the classical representation of Valid Time to that 
of Interaction Time. When users browse through a document (for example 
a hypermedia representation) they choose a particular path in the graph 
representing semistructured information and in this way they define their 
order among the visited objects. Following the research lines of Web Usage 
Mining field, we can create a view depending on each user’s choices, that can 
be useful to personalize data presentation. 
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A first step in this direction consists of tracing users’ navigation history, 
in order to recognize the habits of each user, e.g. hyperlink selections, time 
spent on a particular page or number of accesses to a specific information. 
A second step is using and processing such information, in order to obtain a 
user model not only based on static declarations of interests and stereotypical 
descriptions but on actual interactions. 

Results on such pre-processing activities about historical analysis can be 
successively used for improving and customizing site content presentation: in 
fact the graph-based temporal representation is used not only to keep track 
of user’s preferences, but also to pose all sorts of different queries by means of 
TSS-QL [77]. We have just seen that the language has a graphical version that 
allows us to reduce the problem of solving a query on a graph-based model 
to the problem of performing a matching of the “query graph” with the “site 
graph” as we explained in Chapter 2. Again, as a consequence of the time 
complexity of this problem, our idea is to find the fragment of our language 
that can be translated into the logic CTL in order to apply model-checking 
algorithms to solve (part of) TSS-QL queries. 

4.5.1 Analyzing User History Navigation 

In this section we formalize our approach to keep trace of users’ navigation 
history. 

Given a Web site modeled by a semistructured temporal graph G as de- 
fined in Definition 4.2.1, in a specific time instant t a user U can interact 
with the site whose information is represented by the snapshot of G at time 
t, called S t (G ); in fact S t .(G) stores all the objects and relations of G which 
are current at the time instant t. 

Definition 4.5.1. Let G = ( N,E,r ,£ ) be a semistructured temporal graph, 
a path in G is a non-empty sequence p = (no, ni, . . .) of nodes s.t.: for all i, 
Hi £ N and (n,,nj+i) £ E. 

Now we define how to keep trace of the activities a user takes with his/her 
mouse and keyboard while visiting a Web site during a single session. 

In the sequel, we shall assume a unique identifier to be associated to each 
user; in order to justify this assumption, we observe that several technologies 
are available to this effect, including reading the IP address of the user’s 
machine or the content of a special purpose cookie ID managed by the Web 
server. 

A user session is the period of time when a user started and ended ac- 
cessing a Web site. We can assume that a user session is terminated when a 
user is inactive for more than 30 minutes. 

For simplicity, we can also assume that a site does not change during a 
session, in this way in a specific time instant t a user U can interact with the 
site whose information is represented only by the snapshot of G at time t, 
i.e. S t (G). 
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Definition 4.5.2. The interaction of a user U in a session S (at time t) 
with a Web site represented by a semistructured temporal graph G is a pair 
(p, Givelnfojj ) where: 

1. p = (n\, ri 2 , . • . , n m ) is a path in the snapshot St(G); 

2. Givelnfojj is a labeling function which assigns to each node n in the path 
p the name of the current session and the interval time the user spends 
visiting n (called “thinking time interval”). For all i, GiveInfou(ni) = 
(S,tui) where S is the current session and tu t is the “thinking time tem- 
poral element ” (actually it is composed by a unique interval) of user U 
on node Ui . 

Remark 4-5. 1. Note that in order to avoid duplication of information, we 
can associate, without loss of generality, the labels related to the name of 
the session and the name of the user only to the first visited node (i.e. the 
root of the site graphical representation) . We recall here that a Web site is a 
semistructured temporal graph and thus, it is a rooted graph. 

Givelnfojj can be seen as composed by two single-valued functions, namely 
GivelnfoSjj and GivelnfoTjj, which return respectively the current session 
and the thinking time interval of each node n in a path p. 

Definition 4.5.3. The global interaction of a user U with a Web site rep- 
resented by a semistructured temporal graph G is a set of interactions (as 
introduced in Definition 4.5.2). 

In Figure 4.15 we show a graph-based representation of a bookshop site. 
In this example we suppose that each complex object (drawn as a rectangle) 
represents a Web page, while atomic objects (drawn as ovals) are the infor- 
mation contained in the site pages. Edges between complex objects represent 
navigational links; for readability reasons we do not report their labels, which 
are composed by the name “Link” and the time interval. We omit also the 
edge labels between complex and atomic objects composed by the name “Has 
Property” and the time interval. 

In Figure 4.16 we represent three interactions of a user who visited the 
Web Site represented in Figure 4.15. For example, in session A the user 
starts from the BookSlrop Page, spending 5 minutes on that page, and then 
he/she clicks on the Index Page of Subjects, after 2 minutes the user clicks 
on the page of the “Fantasy” subject, after 3 minutes he/slre visits the page 
of the book titled “Harry Potter and the Prisoner of Azkaban” and so on. 
For readability reasons in Figure 4.16 and Figure 4.17 we represent only 
complex objects because they correspond to visited pages (attributes are 
only information contained in these pages). 

In order to build a history-based user profile, we now analyze the infor- 
mation stored in the “global interaction” of each user. 

We introduce a definition to obtain a graph from a global interaction A 
of a user with a Web site. Intuitively, we construct a labeled graph Ga which 
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Mega Book Store 




<Subject, 5 
[01/01/1988, now)> 

l i 



C^Name^) 

Fantasy 



[tie - 

A time to kill $ 20.55 



C l Title J >* 

A painted house 

d l Priced ) 

$ 27.95 



<Book, [06/02/2001, now)> 



<Book, [01/01/1998. now)> ^ 

d l Priced ) 

$ 6.99 

^ ^d l Title J C 

<Book, [01/01/1999, now)> ^ Harry Potter and the 

1 Pnsoner ol Azkaban 

d l Price d) 

14 ] Harry Potter and the 

— — Goblet of Fire 

Priced ^ 



<Book, [01/01/2000, now)> 



Figure 4.15. A semistructured temporal graph for a bookshop site 



Session A Session B Session C 




Figure 4.16. Global interaction composed by three interactions of a user in sessions 
called A, B, and C 
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has as nodes the set of visited nodes (i.e. all the nodes in A) and as edges 
the set of all node pairs (a, b) such that in at least a path of A the edge (a, b) 
exists. For each node n of Ga a labeling function GiveTotallnfo gives the 
total thinking time, that is the temporal element composed by the thinking 
times GiveInfojj(n) in all the paths where n is present. 

Definition 4.5.4. The merge Ga of a global interaction A = {ii, . . . ,ik} 
over a semistructured temporal graph G is obtained as: for all j € {1, . . . , k}, 
let ij be the interaction ij = ((ni, . . . , n/,. }, GivelnfcP) . 

We obtain a semistructured temporal graph Ga = (N, E, GiveTotallnfo) 
where: 

— N = ComplexU Atomic where Complex = {ni, . . . , n/^} U. . .U{ni, . . . nh k } 
(note that U is the set union and thus N does not have repeated el- 
ements) and Atomic = {m \ ir(m) = atomic A 3 n £ Complex A 
(n, has-property,m) is an edge ofG}. 

— (a, b) £ E with a,b £ N iff there is a path in A containing an edge (a, b). 

— For each node n £ N the labeling function GiveTotallnfo gives the temporal 
element related to n, i.e. the union of all the thinking time intervals of n. 

Note that nodes and edges preserve their original labels composed by 
name and type. 

Observe that the graph we obtain by applying the procedure in Defini- 
tion 4.5.4 may be a non-connected graph; for example, it could be the case 
that the user clicks on different nodes in each session. 

The “merge” graph represents the activities (i.e. clicks on the pages of a 
Web site ) that a user takes in a set of sessions. In particular, its set of nodes 
is composed by all complex objects in the global interaction (i.e. the visited 
pages) and their attributes (i.e. atomic objects). For readability reasons in 
Figure 4.17 we represent only complex objects because they correspond to 
visited pages (attributes are only information contained in these pages). 

From this graph it is possible to derive the time spent by the user on 
different pages. This information may be used as an indication of the user 
preferences. Unfortunately, since this graph contains cycles, we can derive 
neither the exact sequence of clicks, i.e. the precise order between different 
clicks, nor the information about the sessions, because we do not represent 
it. This last piece of information could give us an idea of the favorite order 
when navigating a site of interest. 

The merge graph we obtain by combining the three interaction paths 
depicted in Figure 4.16 is shown in Figure 4.17. Note that from the graph 
it is possible to derive for each node the total thinking time (for example, 
the favorite page for this user seems to be the page of “Harry Potter and 
the Goblet of Fire”, which has the highest thinking time). However, it is not 
possible to derive the navigation history of each session because we do not 
keep trace of session names. 
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Figure 4.17. The “Merge” graph of interactions A, B and C 



While in Figure 4.16 we report the global interaction of a particular user, 
in Figure 4.18 we show the global interaction of a group of users. This rep- 
resentation is useful to analyze the behavior of a set of users (or of a unique 
user working in more parallel sessions) in order to classify their habits and 
preferences. 

Also for a group of users we can derive a “merge graph” containing the 
set of visited pages and the total time spent on each page. Note that in this 
case we can have parallel sessions of different users on the same page, i.e. 
thinking time intervals may overlap, and thus, in order to have the real total 
time spent on a page, we do not merge intervals which have a non empty 
intersection. 



4.6 Using the Query Language TSS-QL to Obtain 
Relevance Information 

In order to achieve our goal regarding the customization of WWW sites, we 
have to elaborate the information on actual interactions of users. Our usage 
of the temporal data model for user profiling is based on the observation 
that the time spent by the user on the individual components of a site can 
be safely assumed as an implicit relevance measure. In fact the information 
about each individual component of a Web site can be obtained by means of 
extended TSS-QL queries (see [77, 39] for further details). More in detail, 
it is possible to query the original Web site graph, the interaction paths and 
the Merge graph, because they are all semistructured temporal graphs. 
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Session A: User 1 Session B: User 2 Session C: User 3 




Figure 4.18. Global interaction of a group of users 



Again, we increase the expressive power of TSS-QL by introducing with 
the CA-Fragment, which is shown in Figures 4.20 and 4.21, the possibil- 
ity to express complex queries with aggregation functions (e.g. MIN, MAX, 
COUNT). 

Now we report some examples of queries useful to extract relevant infor- 
mation on the behavior of users: 

1. Find the hook pages reached from John Grisham’s page with a click 
SELECT Author. Book 

FROM Global Interaction 

WHERE Author. Name = “John Grisham” 

2. Find the book pages reached from John Grisham’s page with some clicks 
SELECT Author*. Book 

FROM Global Interaction 

WHERE Author. Name = “John Grisham” 

3. Find the pages reached from John Grisham’s page with a click 
SELECT Author. Dummy 
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Figure 4.19. The “Merge” graph of a group of users 



FROM Global Interaction 

WHERE Author. Name = “John Grisham” 

4. Find the users who visited John Grisham’s page as last page 
SELECT BookShop— > UserlD 
FROM Global Interaction 
WHERE Author. Name = “John Grisham” 

EXCEPT 

SELECT Author—* UserlD 
FROM Global Interaction 

WHERE Author. Name = “John Grisham” AND 
EXISTS Author. Dummy- 

Note that the word “Dummy” is a placeholder for any node label and the 
field “UserlD” is used to extract the label related to the name of the user 
from the first visited node (cf. Remark 4.5.1). 

4.6.1 Semistructured Temporal Graph as a KTS 

In this section we show how to build the KTS associated with a semistruc- 
tured temporal graph (see [46]), in order to apply the technique based on 
model-checking algorithms to efficiently extract relevant information about 
user navigation history. 
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<query>::= SELECT [<label> AS] <select list> 

[ FROM <from list> ] 

[ WHERE [<label> AS] <condition> ] 

[ GROUP BY <attribute path expression> ] 

[ EXCEPT < query > ] 

<select list>::= <select path>, <select list>|<select path> 

<select path>::= <node path>| 

< attribute t-list> 

<interval path>^INTERVAL 
<aggregation path expression> 

<node path>::= <node label>|<edge path>.<node label> 

<node label>::= <object label>|<attribute label> 

<edge path>::= <object label>.<edge label>| <object lab el >.< wildcard >| 
<object label>.<edge label>.<edge path>| 

<object label>.<wildcard>.<edge path> 

<interval path>::= <object label>|<edge label>|<edge path>| 

<edge path>.<node label> 

<object path>::= <object label>|<edge path>.<object label> 

<attribute t-list path> ::= <attribute t-list>| 

<edge path>.<attribute t-list> 

<attribute t-list> ::= <attribute label>. Temporal. <attribute label>| 
<attribute label> # <attribute label>| 

<attribute label> # <attribute t-list> 

<wildcard>::= * | ? 

<from list > : : = <DB label>, <from list > | <DB label> 

<aggregation path expression> ::= <aggregation operator> 

(<simple path expression>) 

<simple path expression> ::= <node label>| 

<object label>.<edge label>.<simple path expression>| 

<object label>.<wildcard>.<simple path expression>| 

<label> 

<attribute path expression> ::= <attribute label>| 

<object label>.<edge label>.< select path> 

<aggregation operator>::= COUNT | MAX | MIN | AVG 

<from list > : := <DB label>, <from list > | 

<DB label> 



Figure 4.20. The SELECT-FROM clauses of the CA-Fragment 
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<condition>::= <predicate> 

<condition> OR <condition>| 

<condition> AND <condition>| 

<predicate>::=EXISTS <node path>| 

<attribute path> <op> <value> 

<node path > IN <query> 

<attribute path> <op> <attribute path>| 

<interval path> — > INTERVAL <temporal condition> 
<label> = ANY <query > 

<attribute paths :=<attribute label>| 

<object path>.<edge label>.<attribute path>| 

<object path>.<wildcard>.<attribute path>| 

<temporal path expressions := <edge label>.<object label> — > INTERVAL | 

<edge label> INTERVAL 

<temporal condition> ::= <temporal op> <value>| 

<interval opXselect path>.<temporal path expression>| 

<interval opXinterval values>| 

<temporal condition> AND<temporal condition>| 

<temporal condition> OR<temporal condition> 

<op>::= = | >|<|>= | =<)i =! = 

< temporal op>::= begin | end | begin before | end before | 
begin after | end after 

<interval op>::= precede | overlap | equal 

The intuitive meaning of the remaining nonterminals is the following: 

<label> a string 
<DB label> a database name 
<object label> a complex node name 
<attribute label> an atomic node name 
<edge label> an edge label 

<value> a constant string 
<interval values> a constant interval 



Figure 4.21. The WHERE clause of the CA-Fragment 



Definition 4.6.1. Let G = ( N,E,r ,£ ) be a semistructured temporal graph; 

we define the KTS ICq = (Eg, Acte, Hg,^g) over TIq as follows: 

— The set of atomic propositions IIg is the set of all the node labels of G. 

— The set of states is Eg is the set of nodes N. 

— The set of actions Acte includes all the edge labels. In order to capture the 
notion of before we also add to Acte actions for the inverse relations and 
for the negation of all the relations introduced. Moreover, we can assume, 
for each state s corresponding to a complex object with no outgoing edge in 
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E to another complex object, to add a self-loop edge labeled by the special 
action Q that is not in Acta- We recall that only complex objects represent 
site pages. 

— The transition relation 7 Zq contains all the pairs of nodes which are con- 
nected by means of a labeled edge (actually, edge labels are the actions 
defined in the previous point). 

— For the interpretation function Iq we can note that in each state the only 
formulae that hold are the atoms related to the labels and true. 

The queries we have shown in Section 4.4 can be respectively translated 
in the following CTL formulae. 

1. Find the book pages reached from John Grisham’s page with a click 
Book A EX Link -i (Author A EXH a sPro P erty(John Grisham)) 

This CTL formula is true in all Book states which are reachable by means 
of a click from an Author state which possesses a property John Grisham. 

2. Find the book pages reached from John Grisham’s page with some clicks 
Book A EF Link -i (Author A EXHasProperty(John Grisham)) 

This formula is very similar to the previous one, but the path from the 
Book state to the Author state has an arbitrarily length. 

3. Find the pages reached from John Grisham’s page with a click 
true A EX[_[ nk -i (Author A EXHasPro P erty(John Grisham)) 

This CTL formula is true in any state which is reachable by means of 
a click from an Author state which possesses the property John Grisham. 
Note that the atom true holds in each state. 

4. Find the users who visited John Grisham’s page as last page 
EGo(Author A EXHasPro P ert y (John Grisham)) 

This CTL formula is true in all Author leaf states labeled “John Grisham” 
and thus, with this formula we do not actually infer the exact name of 
the users satisfying this requirement (i.e. we do not deal with the UserlD 
property of the Author object). 

In Figure 4.23 we report the translation in the NuSMV format of Session 
B depicted in Fig. 4.16 (for readability reasons we do not include all atomic 
objects), whereas the translation of the four queries we have just proposed is 
in Figure 4.22: 

All the previous queries are contained in the portion of TSS-QL that 
can be solved in polynomial time on the size of the model and the CTL 
formula, by using a model-checker (see Section 4.6.2 for more details). We 
have called this portion of TSS-QL the “Path-Fragment” (i.e. P-Fragment in 
Section 4.4.1) because of the nature of information we can extract. 

Now we introduce some other queries expressible with the portion of TSS- 
QL which uses aggregation operators, called the “Complex Aggregation Frag- 
ment” (i.e. CA-Fragment) . Note that this fragment does not admit a trans- 
lation into CTL because the model-checker cannot elaborate any retrieved 
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SPEC 

l — Book & EX(f! = Author & 

EX(f! = HasProperty & EX(£ = JohnGrisham))) 
SPEC 

l — Book & EF(f! = Author & 

EX(F = HasProperty & EX(t! = JohnGrisham))) 
SPEC 

true & EX(t = Author & 

EX(f! = HasProperty & EX(£ = JohnGrisham))) 
SPEC 

EG (l = Author & 

EX(£ = HasProperty & EX(T = JohnGrisham))) 



Figure 4.22. NuSMV format of CTL queries 



MODULE main 

VAR 

s : {nl, n3, n8, nl4, nl5, nl6}; 

lab : {BookShop, IndAuthor, Author, Book, HasProperty, 
J.K. Rowling}; 

ASSIGN 

init(s) := nl; 
next(s) := case 
s = nl : n3; 
s = n3 : {n8, nl}; 
s = n8 : {nl4, nl5, n3}; 
s = nl4 : {nl4, n8}; 
s = nl5 : {nl6, n8}; 
s = nl6 : nl5; 
esac; 

DEFINE 

£ := case 

s = nl : BookShip; 
s = n3 : IndAuthor; 
s = n8 : Author; 
s = nl4 : Book; 
s = nl5 : HasProperty; 
s = nl6 : J.K. Rowling; 
esac; 



Figure 4.23. NuSMV format of a user session 



information. In order to solve with the model-checking based approach these 
queries, we should develop a system that can analyze the data previously 
extracted by a model-checker. 
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1. Find the favorite book page of a user 
SELECT Book. Name, MAX(clicks) 

FROM Global Interaction 

WHERE clicks = ANY ( SELECT COUNT(Book) 

FROM Global Interaction 
GROUP BY Book. Name) 

2. Find the number of times that a user clicks from the Index of Author to 
the J. K. Rowling ’s page 

SELECT COUNT(IndAuthor. Author) 

FROM Global Interaction 

WHERE Author. Name = “J. K. Rowling” 

3. Find the date when a user visited John Grisham’s page 
SELECT Author^INTERVAL 

FROM Merge 

WHERE Author. Name = “John Grisham” 

The keyword “INTERVAL” is used to extract the time interval of John 
Grisham’s page. 

4.6.2 Complexity Results on TSS-QL Fragments 

In this section we summarize the results on semantics presented in this paper, 
and discuss the main complexity results about TSS-QL fragments. 

The P-Fragment can be correctly translated into CTL , as depicted in 
Figure 4.24, because it includes only simple path queries without join con- 
ditions and operators applied on temporal elements (we remind that queries 
with paths including temporal edges are contained in this fragment). In- 
tuitively, in order to translate semistructured temporal graphs into Kripke 
Transition Systems, we have considered as local properties of each state the 
set of labels of the corresponding node in the semistructured graph. Moreover, 
the translation of a query into a CTL formula is obtained by using the modal 
operators X or F for expressing respectively paths of length one or paths of an 
arbitrarily length. The F operator of CTL allows direct calculation of tran- 
sitive closures that is used to gave semantics of queries containing wildcard. 
Thus, if we remain within the P-Fragment, by using the model-checker we 
can also compute queries containing wildcards. 

With the JT-Fragment we add 1) comparison between values and 2) 
general temporal operators to the language. 

1. Propositional temporal logics cannot express constraints that force two 
states with the same set of local properties to be a unique state. In 
fact, the modal logic semantics forces a bisimulation equivalence between 
subgraphs fulfilling the same formula. This allows us to say that two 
nodes with the same properties are equivalent but not to require that 
they be the same node. For this reason comparisons of values belonging 
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to different paths (i.e. “join-conditions”) do not have always an exact 
translation into CTL. 

2. The problem of general temporal conditions, i.e. conditions on the tem- 
poral element of an object or relationship in a semistructured temporal 
graph, is that model-checking provides an algorithm to find all the states 
which satisfy a formula and these states are characterized by the set of 
their local properties. Thus, a model-checker cannot extract and analyze 
only a subset of these properties, for example only the temporal element. 
In fact, with our approach we can correctly check temporal conditions 
with the equality operator, i.e. the model-checker can only tests if a given 
temporal element is a property of a state. 

In the same way, the CA-Fragment includes some queries without a 
translation into CTL. In this case the reason is that a model-checker cannot 
elaborate any previously retrieved information and thus, cannot deal with 
typical aggregation functions of SQL, such as MIN, MAX, and COUNT. 




Figure 4.24. Complexity of TSS-QL Fragments with respect to CTL 




5. Related Works 



In this thesis we have focused our attention on three languages for semistruc- 
tured data: G-Log, W, and TS-QL. We have deeply studied the relationships 
between them by considering their expressive power and by applying tech- 
niques coming from the model-checking field to efficiently solve the data 
retrieval problem. 



5.1 Semantics Aspects of Query Languages 

Several articles have pointed out the need for data models and query 
languages to integrate heterogeneous data sources and a lot of propos- 
als have emerged in the past years. We can cite LOREL [66], UnQL [12], 
GraphLog [33], G-Log [81], WG-Log [32], WebSQL [67], WebOQL [7] and, 
StruQL [50] . 

Among them, a graphical language that shows many similarities with G- 
Log is GraphLog [33], a declarative language with graph-based data model 
and rules. GraphLog is a deterministic language intended for the relational 
model, it is not Turing-complete, and allows no sequences of rules. Moreover, 
GraphLog query graphs are acyclic and the queries, which require patterns 
to be present in (or absent from) the database graph, are supposed to extend 
databases only with new edges (i.e. queries can only define new relations). 
Conversely, G-Log was originally developed as a language and data model 
for complex objects with identity [1], and in its full form it is Turing complete 
(see [83]). The structure and the meaning of queries in the two languages are 
rather similar, but cycles are allowed in G-Log, and G-Log rules enable the 
user to extend the databases both with entities and relations. 

Similarities can also be found between our approach to give semantics for 
graphical languages and previous works on UnQL [12], where the notion of 
bisimulation is used for investigating query decomposition. However, differ- 
ences between G-Log and UnQL are quite deep. For instance, when assigning 
semantics to the language basic blocks, we allow information to be located 
in graph nodes, while UnQL locates information on edges; more importantly, 
G-Log queries are written directly in the graph formalism, while UnQL de- 
scribes data instances graphically, and the query language of UnQL is SQL- 
like. Moreover, G-Log allows to express cyclic information and queries, and 
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achieves its high expressive power by allowing a fully user-controlled non- 
determinism. 

Anyway, keeping in mind these differences, the results of the work shown 
in this thesis about the bisimulation-based semantics, can be also applied to 
GraphLog and UnQL, and in general to any graphical language whose main 
aim is to query and transform a graph-based data model by using graph-based 
rules. 



5.2 Efficient Query Retrieval 

In the semistructured database field we find also many attempts to give 
a graph-based representation of first-order logic or to translate graphical 
queries into logical formulae (see for example [81, 33, 10, 61]). More recently, 
other proposals have studied the relationships between databases and modal 
logics (see [20, 16, 75, 76, 43]), but none of them attack the problem of effi- 
ciently solve queries with model-checking algorithms. 

A work which introduces temporal-logic queries for model understanding 
and model checking is [20]. The author proposes an approach for inferring 
properties of models as a special case of evaluating temporal-logic queries 
by using a model checker. In [20] a query is a CTL formula with a special 
symbol “?’, called placeholder, which appears exactly once. Given a model 
M, the semantics of a query is a proposition p such that replacing “?’ with 
p in the query, it results a formula that holds in M and is as strong as 
possible. Moreover, the author defines the subset of CTL queries that are 
guaranteed to be evaluated in time linear in the size of the model and in the 
length of the query. This approach could be used in the semistructured data 
field to understand properties of semistructured instances and to iteratively 
gain knowledge about them. In fact, it is well known that semistructured 
data [2] has some structure but it is often irregular and thus, an efficient 
approach based on model checking techniques could be applied in order to 
derive a possible representation, or an approximation of the structure, of a 
data source for driving users in the query formulation process. 

Consider for example the W-instance of Fig. 3.2. The evaluation of the 
query AG^ ct ? on that W-graplr, gives as result p = Teacher V Course V 
DatabasesVStudentVSmithV40V37, i.e. an indication of all the labels included 
in the model of Fig. 3.2. Whereas the evaluation of the query Teacher A EX.^? 
on the same model gives as result all the successors of Teacher states, i.e. 
p = Course V 40 V 37. 

In [16] the authors define a query language for semistructured data that 
is based on the ambient logic [17], which is a modal logic originally proposed 
to describe the structural and computational properties of distributed and 
mobile computations. The ambient logic is designed to describe properties 
of labeled trees and thus, it is suitable to be used like a query language for 
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semistructured data. The proposed query language, called Tree Query Lan- 
guage (TQL), is characterized by the fact that a matching expression is a 
logic expression combining matching and logical operators. It would be in- 
teresting to compare TQL with the temporal logic CTL in term of expressive 
power and to show the possibility to apply polynomial model-checking algo- 
rithms also to this SQL-like query language, which is essentially based on a 
modal logic. 

In [75] the authors suggest to query data using ‘query automata’, that 
are particular cases of two-ways deterministic tree automata. They show the 
equivalence with formula expressible in second-order monadic logic. In [76] 
they identifies subclasses of formulae/automata that allow efficient algo- 
rithms. A future work is to determine the exact relationships with our ap- 
proach. 

An attempt to apply model-checking in the WWW field is introduced 
in [43] , where model-checking is used for the analysis of the World Wide Web, 
i.e. to help with the detection of errors in the structure and connectivity of 
Web pages. In this context, the model (a Kripke Structure) of the Web is 
not known in advance, but can only be gradually explored by using model- 
checking techniques. In particular, in [43] a constructive /./-calculus is defined 
in order to effectively apply algorithms for the analysis of the Web, where 
some operations of the ordinary //-calculus [49], such as the computation of 
sets of predecessor Web pages and the computation of the greatest fixpoint, 
are not possible because of the graph structure of the Web itself. 



5.3 Temporal Models and Query Languages for 
Semistructured Data 

The possibility to use an efficient algorithm for the CTL temporal logic in 
order to solve static queries on semistructured databases has increased our 
interest in studying real temporal properties of semistructured data as well. 

In the past years temporal database management was a promising field of 
research that is now covered in textbooks (see for example [99, 3, 48]) and sev- 
eral data models, which usually extended relational data models [29, 74, 86], 
were proposed in order to consider the time-varying nature of data. How- 
ever, relational data models [29, 74, 86] are flat and thus unsuitable for 
semistructured data. In this thesis we have introduced a general tempo- 
ral data model based on labeled graphs, which is built upon concepts from 
temporal databases and semistructured data models, to store time-varying 
semistructured information. 

Other works on temporal aspects for semistructured data are [15, 6, 47, 

21 , 22 ], 

In [15] the authors describe a data warehousing system for managing se- 
lected useful historical web information. To support temporal queries and 
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data analysis they define a temporal web data model which is able to cap- 
ture web documents together with their updating history into temporal web 
tables. The graph-based model proposed in [15] is designed to support the rep- 
resentation of Web sites and is not general enough to be applied to semistruc- 
tured databases: each node in the graph denotes a web page, and each directed 
link denotes a hyperlink between two web pages. Moreover, in order to keep 
trace of historical information the authors attach a temporal attribute to each 
node to denote the valid time interval of its web document instances. The 
valid time is an interval enclosed by two time points, called last-modified time 
and downloading time. This interval indicates the period within which the web 
document captured by the node remains unchanged. The last-modified time 
indicates the latest time a web object is changed while the downloading time 
represents the date and time at which a web page departs the HTTP server 
from which the web document is accessed. In our opinion, this time interval 
does not represent the valid time of a given web object, because it is not 
connected to the real content of the web page. Moreover, the time interval 
does not represent the transaction time of a web object in a site and seems 
to be the transaction time of the system developed to create the Warehouse. 

Another work which focuses on the definition and the implementation 
of a logical data model for representing histories of XML documents is [6]. 
The proposed model is an extension of the XPath [95] data model with a 
label on edges regarding the valid time concept. In particular, the model 
is used for XML documents and thus, edges represent only the containment 
relation between Web pages or portion of pages. Our idea is to study temporal 
properties on a more general temporal model suitable to represent generic 
semistructured data (not only Web sites). 

In [47] the authors propose a semistructured model where annotations on 
edge labels allow to record information about updates. In particular, they 
consider annotations about transaction time, valid time and kind of update. 
They also define temporal operators supporting coalescing, collapsing and 
slicing operations. In the proposed model, like in the OEM one [80], the edge 
labels do not actually represent relationships between objects, they are used 
to make nodes self-describing in the sense that a node is characterized by the 
sequences of labels on paths through the graph that lead to the node itself. 
Once again, we choose to consider a more general and expressive model based 
on labeled graphs where both edges and nodes have properties (expressed by 
means of labels) . In this way we can also consider semantic relationships be- 
tween nodes connected by paths; consequently information about transaction 
time have to been added to both nodes and edges. 

Another two works which address the problem of representing and query- 
ing changes in semistructured data are [21, 22]. The context taken into con- 
sideration in these works is similar to ours and thus we deeply describe the 
main differences and the advantages of our model. 
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5.3.1 Comparison with the DOEM Model 

In [21, 22] the authors propose a model based on the Object Exchange Model 
(OEM) (see [80]), a simple graph-based data model, with objects as nodes 
and object-subobject relationships represented as labeled arcs. Change op- 
erations (i.e. node insertion, update of node values, addition and removal of 
labeled arcs) are represented in DOEM (for Delta-OEM) by using annota- 
tions on the nodes and arcs of an OEM graph. Intuitively, annotations are the 
representation of the history of nodes and edges. To implement DOEM and 
Clrorel they use a method that encodes DOEM databases as OEM databases 
and translates Clrorel queries into equivalent Lorel queries over the OEM en- 
coding. With this implementation they allow annotations on a DOEM graph 
D to be attached to the nodes and edges of a OEM database Od- In this 
way, for each object o of D there is a complex object o' in Od with some sub- 
objects, which are required in order to represent the history of the node o, its 
old values or relationships with other nodes. This method causes a growth in 
the number of nodes of the final graph over which it is possible to apply the 
query. 

The main aim of our approach is to overcome this problem: on one hand, 
we adopt a model which reduces the amount of annotations needed for in- 
formation representation; on the other hand, we plan to effectively apply 
model-checking techniques as a way to solve queries which are able to take 
into account static and dynamic aspects of semistructured data. When the 
properties belong to some classes of formulae the problem of finding if a model 
possesses a given property is decidable and algorithms running in linear time 
on both the sizes of the model and the formula can be employed [27] . Conse- 
quently it is important to keep the size of semistructured databases as small 
as possible. 

For example, in Fig. 5.1 we show the DOEM graph corresponding to a 
portion of the instance in Fig. 4.1 of Chapter 4: note that only edges are 
labeled with the name of destination objects. Actually, in both OEM and 
DOEM edges represent containment relations between objects, and do not 
allow to indicate different semantic relationships between objects, for this 
reason our model has a higher expressive power. Annotations are represented 
in DOEM by rectangles containing the name of the change operation and the 
date when the modification occurred. 

In Fig. 5.2 we report the OEM database corresponding to DOEM graph 
in Fig. 5.1. Note that in this OEM graph annotations have been encoded 
into complex objects storing the history of change operations by means of 
connected sub-objects. To highlight the difference between our temporal data 
model and DOEM we consider an update operation to an atomic object, 
because it clearly distinguishes the two approaches by considering the number 
of added nodes. For example, to perform the update on the Title of the Course 
previously called Data Bases, into Web Data Bases in the semistructured 
graph of Fig. 4.1, we only add an atomic object with the new name, whereas in 
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Figure 5.1. A DOEM semistructured database 



DOEM the number of added nodes is shown in the dashed region of Fig. 5.2. 
Note that we add the minimum piece of information to keep trace of the 
update. 

Our graphical model allows to represent and retrieve temporal information 
about semistructured data according to at least two interpretations of the 
time concept that are Valid Time and Interaction Time. The former notion 
is widely used in classic temporal database field, whereas the latter is related 
to the time a user spends while visiting Web sites. 

Monitoring and analyzing how the Web is used is nowadays an active area 
of research in both the academic and commercial worlds. Web Usage Mining 
or Analysis field can be defined as being interested in patterns of behavior 
for Web users. 

In particular, personalizing the Web experience for a user is a crucial task 
of many Web-based applications such as e-commerce applications. In fact, 
making dynamic and personalized recommendations to a Web user, based 
on their profile and not only on general usage behavior, is very attractive in 
many applications. Web Usage Mining is an excellent approach for achieving 
this goal, as shown in [92] . 

Some existing systems, such as Web Watcher [58], Letizia [63], WebPer- 
sonalizer [72], have all concentrated on providing Web Site personalization 
based on usage information. WebWatcher [58] “observes” a user as they 
browse the Web and identifies links that are potentially interesting to the 
user. Letizia [63] is a client side agent that searches for pages similar to ones 
that the user has already visited. The WebPersonalizer [72] page recommen- 
dations are based on clusters of pages found from the server log for a site. The 
system recommends pages from clusters that most closely match the current 



session. 



5.3 Temporal Models and Query Languages for Semistructured Data 125 




Figure 5.2. A OEM semistructured database 



Web Usage Mining techniques introduced in [23, 35, 98, 91] have been 
used for discovering frequent item sets, association rules, clusters of similar 
pages and users, sequential patterns, and to perform path analysis. 

As we said before, analysis of how users are visiting a site is crucial for 
optimizing the structure of the Web site and thus, there are a lot of reasons in 
pre-processing data for mining before running the mining algorithms in order 
to address some important issues (see for example [34]). These include de- 
veloping a model of access log data and developing techniques to clean/filter 
the raw data in order to eliminate outliners and/or irrelevant items. Accord- 
ing to [36] a data model can be constructed by considering the following 
information: 

— Content: the real data stored in the Web pages. 

— Structure: data which describe the organization of the content. 

— Usage: data that describe the usage pattern of Web pages, such as, for 
example, page references, time and date of accesses. 

— User Profile: data that provide personal information about users of the 
Web site. 

Moreover, before any mining is done on Web usage data, sequences of 
page references must be grouped into logical units representing user sessions. 
A user session is the set of page references made by a user during a single visit 
to a site and is the smallest transaction to consider for grouping interesting 
page references as proposed in [23, 35]. 
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In the context of pre-processing tasks, our proposal is to use a database 
approach based on our graphical data model to store user interactions and to 
apply either a SQL-like query language or the model-checking based approach 
to retrieve information of interest about user navigation history. As a conse- 
quence, the possibility to apply the same model in different contexts allows 
us to assert that a generic temporal model (i.e. a model that is not strictly 
application-dependent) may be useful to represent more interpretations of 
the “time concept”. 




6. Conclusion 



In this work we have proposed a method to provide suitable graph-based se- 
mantics to languages for semistructured data, supporting both data structure 
variability and topological similarities between queries and document struc- 
tures. A suite of operational semantics based on the notion of bisimulation 
has been introduced both at the concrete level (instances) and at the abstract 
level (schemata). 

In the context of semistructured and WWW data efficiency is a primary 
requirement and thus, complexity results on subgraph-bisimulation problem 
have stimulated our interest in investigating alternative approaches to solve 
(graphical) queries on databases that do not present a rigid structure. The 
main idea of the work is to effectively solve the data retrieval problem for 
semistructured data by using techniques and algorithms coming from the 
model-checking field. In particular, we have considered three languages for 
semistructured data: G-Log, W, and TS-QL. We have first studied G-Log, 
because it is a general language that offers the expressive power of logic, 
the modeling power of object-oriented DBs, and the representation power of 
graphs for drawing instances, schemata and graphs. W is more or less the 
subset of G-Log that has a natural translation into the modal logic setting. 
Finally, in order to express temporal queries we have introduced a SQL-like 
language that allows to express complex conditions on the time- varying na- 
ture of semistructured data. However, we have shown that it is quite natural 
to extend graphical languages (e.g. the G-Log language) to express temporal 
properties with labeled graphs. 

In the recent years a lot of attention has been given by the research com- 
munity to the study of methods for representing and querying semistructured 
data. In our opinion, the approach we proposed in this work presents several 
benefits: 

— first of all, we benefit from the cross-fertilization of methods originally 
designed in quite different research areas (Databases, Logics, and Model- 
Checking) and with our proposal we try to establish a relationship between 
them. 

— Moreover, from a formal point of view, the work is also interesting for clas- 
sifying polynomial subclasses of the subgraph isomorphism/bisimulation 
problems. 
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— The approach based on model-checking algorithms is general enough to be 
applied not only to our graphical language, but to the main existing lan- 
guages for semistructured data as well, and thus, can be further developed 
to effectively solve the data retrieval problem. 

— The nature of the CTL logic has allowed us to extend our main idea to the 
temporal database setting. First of all, we have shown that a general graph- 
based model can be used to represent both static and dynamic aspects 
of semistructured data. Moreover, we have applied temporal formulae to 
express real temporal queries and model-checking to solve them. 

— To conclude, we have concentrated our attention not only on the classical 
notion of the “time concept” , as it is deeply known in the database field, 
but we have further investigated on the possibility to consider more than 
one dynamic aspects of semistructured data. 

The subjects of this work represent and introductory study from which 

many extensions in different directions may start. In particular, we plan to 

investigate on the following aspects: 

— The graph-formula translation based on model-checking techniques could 
be extended to a wider family of graphs including join conditions. It might 
also be interesting to find the exact relationships between the set of CTL* 
formulae and the set of formulae needed for cyclic queries. 

— We have implemented the method on the model-checker NuSMV. However, 
for an effective implementation on WWW data some form of heuristic (e.g. 
check part of the domain name) must be applied in order to cut the search 
space for accessible data (that can be all the web). This issue is studied 
in [43] in a slight different context. 

— Another future direction is to mix this technique with constraint-based 
data retrieval (where temporal formulae can be used directly). 

— In the temporal database field, we plan to provide a formal semantics 
for our SQL-like query language and to extend it in order to include the 
possibility to express updates. Moreover, we will investigate on the ex- 
act translation between the SQL-like version of TS-QL and its graphical 
counterpart, by considering the expressive power and limits of the two 
approaches. 

— We plan also to investigate on the problem of using the temporal sequence 
of all navigation actions taken by a user, as an input to a site personal- 
ization tool. Another aim of our database approach is to find techniques, 
coming from the Web Mining field, to use previous history to reduce the 
amount of browsing actions needed to achieve a user’s search goal. 
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